I'm trying to parse an HTML page with a lot of tables. I've searched the net on how to parse HTML with Objective C and I found hpple. I'd look for a tutorial which lead me to:
http://www.raywenderlich.com/14172/how-to-parse-html-on-ios
With this tutorial I tried to parse some forum news which has a lot of tables from this site (Hebrew): news forum
I tried to parse the news title, but I don't know what to write in my code. Every time I try to reach the path I get, "Nodes was nil."
The code of my latest attempt is:
NSURL *contributorsUrl = [NSURL URLWithString:#"http://rotter.net/cgi-bin/listforum.pl"];
NSData *contributorsHtmlData = [NSData dataWithContentsOfURL:contributorsUrl];
// 2
TFHpple *contributorsParser = [TFHpple hppleWithHTMLData:contributorsHtmlData];
// 3
NSString *contributorsXpathQueryString = #"//body/div/center/center/table[#cellspacing=0]/tbody/tr/td/table[#cellspacing=1]/tbody/tr[#bgcolor='#FDFDFD']/td[#align='right']/font[#class='text15bn']/font[#face='Arial']/a/b";
NSArray *contributorsNodes = [contributorsParser searchWithXPathQuery:contributorsXpathQueryString];
// 4
NSMutableArray *newContributors = [[NSMutableArray alloc] initWithCapacity:0];
for (TFHppleElement *element in contributorsNodes) {
// 5
Contributor *contributor = [[Contributor alloc] init];
[newContributors addObject:contributor];
// 6
Could somebody guide me through to getting the titles?
Not sure if that's the option for you, but if desired table have unique id's you could use a messy approach: load that html into UIWebView and get contents via – stringByEvaluatingJavaScriptFromString: like this:
// desired table container's id is "msg"
NSString* value = [webView stringByEvaluatingJavaScriptFromString:#"document.getElementById('msg').innerHTML"];
Related
I have a set of HTML code, here:
<div id="content_text">
<p>Year 11 students will be making their course selections online this year.
</p>
<p>Information about this system has been made available through Tutor sessions. Each student will have an individual password. Once subject selections have been made students are to print out a copy of their choices and then have this form signed by themselves, their parent and their Tutor. Forms are to be completed by 22 August. Course books can be borrowed from the Library or are available online.
Now my problem is, is that this is fed from an RSS FEED article web page and there may be 1 or even 11 <p> tags within this one <div id="content_text">. How can I fetch all of the <p> in this divider and display them formatted into a UITextField?
I am currently using the XPathQuery, btw so currently my parse looks like this:
NSData *tutorialsHtmlDataTwo = [NSData dataWithContentsOfURL:[NSURL URLWithString:_storyLink]];
TFHpple *tutorialsParserTwo = [TFHpple hppleWithHTMLData:tutorialsHtmlDataTwo];
NSString *tutorialsXpathQueryStringTwo = #"//div[#id='content_text']/p";
NSArray *tutorialsNodesTwo = [tutorialsParserTwo searchWithXPathQuery:tutorialsXpathQueryStringTwo];
NSMutableArray *newTutorialsTwo = [[NSMutableArray alloc] initWithCapacity:0];
for (TFHppleElement *element in tutorialsNodesTwo) {
Tutorial *tutorialTwo = [[Tutorial alloc] init];
[newTutorialsTwo addObject:tutorialTwo];
tutorialTwo.title = [[element firstChild] content];
_rssBody.text = [NSString stringWithFormat:#"%#", [[element firstChild] content]];
}
So as you can see it will only parse the second line. Any help appreciated.
Thanks, SebOH.
Please use this query to find all the elements inside given element.
div[#id='content_text']
I am using the following code to convert text to pdf form:
NSString *filePath = [[NSBundle mainBundle] pathForResource:#"All_lang_unicode" ofType:#"txt"];
NSString *str;
NSData *myData = [NSData dataWithContentsOfFile:filePath];
if (myData) {
str = [[NSString alloc] initWithData:myData encoding:NSUTF16StringEncoding];
NSLog(#"STRING : %#",str);
}
NSString *html = [NSString stringWithFormat:#"<body>%#</body>",str];
UIMarkupTextPrintFormatter *fmt = [[UIMarkupTextPrintFormatter alloc]
initWithMarkupText:html];
UIPrintPageRenderer *render = [[UIPrintPageRenderer alloc] init];
[render addPrintFormatter:fmt startingAtPageAtIndex:0];
CGRect page;
page.origin.x=0;
page.origin.y=0;
page.size.width=792;
page.size.height=612;
CGRect printable=CGRectInset( page, 0, 0 );
[render setValue:[NSValue valueWithCGRect:page] forKey:#"paperRect"];
[render setValue:[NSValue valueWithCGRect:printable] forKey:#"printableRect"];
NSLog(#"number of pages %d",[render numberOfPages]);
NSMutableData * pdfData = [NSMutableData data];
UIGraphicsBeginPDFContextToData( pdfData, CGRectZero, nil );
for (NSInteger i=0; i < [render numberOfPages]; i++)
{
UIGraphicsBeginPDFPage();
CGRect bounds = UIGraphicsGetPDFContextBounds();
[render drawPageAtIndex:i inRect:bounds];
}
UIGraphicsEndPDFContext();
NSArray *paths = NSSearchPathForDirectoriesInDomains(NSDocumentDirectory, NSUserDomainMask, YES);
NSString *documentsDirectory = [paths objectAtIndex:0];
NSString * pdfFile = [documentsDirectory stringByAppendingPathComponent:#"test.pdf"];
[pdfData writeToFile:pdfFile atomically:YES];
But problem is that I am not getting the proper formatting of the text. when I print using NSLog(); I get the proper content but when I place the string in STRING the spacing and newline is missing.. all coming in same line. i.e. continuous.
(UPDATE : )
NSLog OUTPUT:(Proper)
NEW DELHI: Sachin Tendulkar's streak of low scores might have raised a question mark over his future but senior BCCI official and IPL chairman Rajiv Shukla on Monday came out in support of the senior batsman saying one needs to look at his "colossal record" before making any comment.
"He will hang up his boots when he thinks it's time for him to go. He does not need any advice on this. Before making a comment on his performance you have to see his colossal record and his past performance," Shukla told reporters outside the Parliament adding that the veteran cricketer will come back strongly in the forthcoming matches.
and Im getting as:
NEW DELHI: Sachin Tendulkar's streak of low scores might have raised a question mark over his future but senior BCCI official and IPL chairman Rajiv Shukla on Monday came out in support of the senior batsman saying one needs to look at his "colossal record" before making any comment. "He will hang up his boots when he thinks it's time for him to go. He does not need any advice on this. Before making a comment on his performance you have to see his colossal record and his past performance," Shukla told reporters outside the Parliament adding that the veteran cricketer will come back strongly in the forthcoming matches.
Can any one please suggest modification in this code so that I can get the proper format.
If I get it right, you should replace your new line characters with <br> or <p>.
Try
str = [str stringByReplacingOccurrencesOfString:#"\n" withString:#"<br>"];
How to detect new lines in Objective-C
Solution of your next question might look like this:
NSArray *words = [str componentsSeparatedByString:#" "];
NSString *line = #"";
NSUInteger maxLineLength = 100;
NSString *resultStr = #"";
for (NSString *word in words) {
if ([line length] + [word length] > maxLineLength) {
resultStr = [resultStr stringByAppendingFormat:#"%#<br>", line];
line = word;
} else {
line = [line stringByAppendingFormat:#" %#", word];
}
}
resultStr = [resultStr stringByAppendingString:line];
Basically I need to parse td(table data) from this html file.I need to get the right xpath.I am using raywenderlich as a model for this task, and here is the code I have so far.
NSURL *tutorialsUrl = [NSURL URLWithString:#"http://example.com/events];
NSData *tutorialsHtmlData = [NSData dataWithContentsOfURL:tutorialsUrl];
// 2
TFHpple *tutorialsParser = [TFHpple hppleWithHTMLData:tutorialsHtmlData];
// 3
NSString *tutorialsXpathQueryString = #"This is where I need to enter my xpath to rerieve the table data";
NSArray *tutorialsNodes = [tutorialsParser searchWithXPathQuery:tutorialsXpathQueryString];
I have the html path to this element thanks to firebug,which I will post below.
/<html lang="en">/<body>/div id="page" class="container">/<div class="span-19">/<div id="content">/<div>/<table id=yw0 class="detail-view">/<tbody>/<tr class="even">/<td>moo</td>/
I need the text moo to be parsed. Any help will be deeply appreciated.
this is the x path I get from firebug as well, but it didn't work at all.
/html/body/div/div[4]/div/div/table/tbody/tr[2]/td
At first, you need to get substrings, where each substring contains one element that needs to be extracted:
NSArray *split = [text componentsSeparatedByString:#"<td>"];
In array "split", first object contains nothing you want, so you will not work with it anymore. Now, for each substring in this array (except first one) you need to search for substring with "/td" tag:
NSRange range = [string rangeOfString:#"</td>"];
and then remove it and everything what is behind it:
- (NSString *)substringToIndex:(NSUInteger)anIndex //you will get index by searching for "</td>" as mentioned
EDIT:
Another possibility is to use componentsSeparatedByString even instead of 2nd and 3rd step for mentioned tag and in first item of each array, you will have wanted text.
EDIT2: (whole code)
NSString* originalText = #" /<html lang=""en"">/<body>/div id=""page"" class=""container"">/<div class=""span-19"">/<div id=""content"">/<div>/<table id=yw0 class=""detail-view"">/<tbody>/<tr class=""even"">/<td>moo1</td><td>moo2</td>/";
NSArray* separatedParts = [originalText componentsSeparatedByString:#"<td>"];
NSMutableArray* arrayOfResults = [[NSMutableArray alloc] init];
for (int i = 1; i < separatedParts.count; i++) {
NSRange range = [[separatedParts objectAtIndex:i] rangeOfString:#"</td>"];
NSString *partialResult = [[separatedParts objectAtIndex:i] substringToIndex:range.location];
[arrayOfResults addObject:partialResult];
}
I have slightly altered original text to show that its really working for table with more items inside
The code below takes all of the text from a certain div. Is it possible for me to take all the text from the div as well as the html attributes? So it also adds all of the <p> </p>'s and <br> </br>'s to the string, myString?
//trims string from previous page
NSString *trimmedString = [stringy stringByTrimmingCharactersInSet:
[NSCharacterSet whitespaceAndNewlineCharacterSet]];
NSData *data = [[NSString stringWithContentsOfURL:[NSURL URLWithString:trimmedString]] dataUsingEncoding:NSUTF8StringEncoding];
TFHpple *xpathParser = [[TFHpple alloc] initWithHTMLData:data];
NSArray *elements = [xpathParser searchWithXPathQuery:#"//div[#class='field-item even']"];
TFHppleElement *element = [elements lastObject]; //may need to change this number?!
NSString *mystring = [self getStringForTFHppleElement:element];
trimmedTextView.text = [trimmedTextView.text stringByAppendingString:mystring];
Method here:
-(NSString*) getStringForTFHppleElement:(TFHppleElement *)element
{
NSMutableString *result = [NSMutableString new];
// Iterate recursively through all children
for (TFHppleElement *child in [element children])
[result appendString:[self getStringForTFHppleElement:child]];
// Hpple creates a <text> node when it parses texts
if ([element.tagName isEqualToString:#"text"])
[result appendString:element.content];
return result;
}
Any ideas would be appreciated. Cheers.
Try this:
NSString *htmlDataString = [webView stringByEvaluatingJavaScriptFromString: #"document.documentElement.outerHTML"];
This will take all the HTML out to string. You can then parse it in your native code and find div which is your interest what you have did in above example.
You can do it as well with any DOM element in your HTML like:
NSString *htmlDataString = [webView stringByEvaluatingJavaScriptFromString: #"document.documentElement.getElemenById('mydiv')"];
which is more efficient but requires a bit of javascript skill.
Rather than converting HTML to an attributed string, I need to convert it back to HTML. This can easily be done on Mac as can be seen here: http://www.justria.com/2011/01/18/how-to-convert-nsattributedstring-to-html-markup/
Unfortuately, the method dataFromRange:documentAttributes: is only available on Mac via the NSAttributedString AppKit Additions.
My question is how can you do this on iOS?
Not the 'easy' way, but what about iterating through the attributes of the string using:
- (void)enumerateAttributesInRange:(NSRange)enumerationRange
options:(NSAttributedStringEnumerationOptions)opts
usingBlock:(void (^)(NSDictionary *attrs, NSRange range, BOOL *stop))block
Have an NSMutableString variable to accumulate the HTML (lets call it 'html'). In the block, you would construct the HTML manually using strings. For instance if the text attributes 'attrs' specify red, bold text:
[html appendFormat:#"<span style='color:red; font-weight: bold;'>%#</span>", [originalStr substringWithRange:range]]
EDIT: Stumbled across this yesterday:
NSAttributedString+HTMLFromRange category from "UliKit"
(https://github.com/uliwitness/UliKit/blob/master/NSAttributedString+HTMLFromRange.m)
Looks like it will do what you want.
Use the below code. it works well.
NSAttributedString *s = ...;
NSDictionary *documentAttributes = #{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType};
NSData *htmlData = [s dataFromRange:NSMakeRange(0, s.length) documentAttributes:documentAttributes error:NULL];
NSString *htmlString = [[NSString alloc] initWithData:htmlData encoding:NSUTF8StringEncoding];