Trying to pull tabledata out from html - html

Basically I need to parse td(table data) from this html file.I need to get the right xpath.I am using raywenderlich as a model for this task, and here is the code I have so far.
NSURL *tutorialsUrl = [NSURL URLWithString:#"http://example.com/events];
NSData *tutorialsHtmlData = [NSData dataWithContentsOfURL:tutorialsUrl];
// 2
TFHpple *tutorialsParser = [TFHpple hppleWithHTMLData:tutorialsHtmlData];
// 3
NSString *tutorialsXpathQueryString = #"This is where I need to enter my xpath to rerieve the table data";
NSArray *tutorialsNodes = [tutorialsParser searchWithXPathQuery:tutorialsXpathQueryString];
I have the html path to this element thanks to firebug,which I will post below.
/<html lang="en">/<body>/div id="page" class="container">/<div class="span-19">/<div id="content">/<div>/<table id=yw0 class="detail-view">/<tbody>/<tr class="even">/<td>moo</td>/
I need the text moo to be parsed. Any help will be deeply appreciated.
this is the x path I get from firebug as well, but it didn't work at all.
/html/body/div/div[4]/div/div/table/tbody/tr[2]/td

At first, you need to get substrings, where each substring contains one element that needs to be extracted:
NSArray *split = [text componentsSeparatedByString:#"<td>"];
In array "split", first object contains nothing you want, so you will not work with it anymore. Now, for each substring in this array (except first one) you need to search for substring with "/td" tag:
NSRange range = [string rangeOfString:#"</td>"];
and then remove it and everything what is behind it:
- (NSString *)substringToIndex:(NSUInteger)anIndex //you will get index by searching for "</td>" as mentioned
EDIT:
Another possibility is to use componentsSeparatedByString even instead of 2nd and 3rd step for mentioned tag and in first item of each array, you will have wanted text.
EDIT2: (whole code)
NSString* originalText = #" /<html lang=""en"">/<body>/div id=""page"" class=""container"">/<div class=""span-19"">/<div id=""content"">/<div>/<table id=yw0 class=""detail-view"">/<tbody>/<tr class=""even"">/<td>moo1</td><td>moo2</td>/";
NSArray* separatedParts = [originalText componentsSeparatedByString:#"<td>"];
NSMutableArray* arrayOfResults = [[NSMutableArray alloc] init];
for (int i = 1; i < separatedParts.count; i++) {
NSRange range = [[separatedParts objectAtIndex:i] rangeOfString:#"</td>"];
NSString *partialResult = [[separatedParts objectAtIndex:i] substringToIndex:range.location];
[arrayOfResults addObject:partialResult];
}
I have slightly altered original text to show that its really working for table with more items inside

Related

Is this a legal/safe way to pull data from websites on iOS?

After playing around with a few different ways to pull website data I developed this simple and quick solution that appears to work well:
int zip = 13153;
int lowerBound = 10000;
int upperBound = 99999;
bool foundValidZip;
#implementation ViewController
- (void)viewDidLoad {
[super viewDidLoad];
while (foundValidZip == false) {
zip = lowerBound + arc4random() % (upperBound - lowerBound);
// Do any additional setup after loading the view, typically from a nib.
NSString *urString = [NSString stringWithFormat:#"http://www.zip-info.com/cgi-local/zipsrch.exe?zip=%i&Go=Go",zip];
NSURL *URL = [NSURL URLWithString:urString];
NSData *data = [NSData dataWithContentsOfURL:URL];
// Assuming data is in UTF8.
NSString *html = [NSString stringWithUTF8String:[data bytes]];
NSLog(#"%#",html);
NSMutableArray *names = [self stringsBetweenString:#"</th></tr><tr><td align=center>" andString:#"</font></td>" andText:html];
NSMutableArray *states = [self stringsBetweenString:#"</font></td><td align=center>" andString:#"</font></td><td align=center>" andText:html];
if ([names count] > 0 && [states count] > 0) {
NSString *name = [names objectAtIndex:0];
NSString *state = [states objectAtIndex:0];
self.nameLabel.text = name;
self.stateLabel.text = state;
self.zipLabel.text = [NSString stringWithFormat:#"%i",zip];
foundValidZip = true;
}
else {
foundValidZip = false;
}
}
}
-(NSMutableArray*)stringsBetweenString:(NSString*)start andString:(NSString*)end andText:(NSString*)text {
NSMutableArray* strings = [NSMutableArray arrayWithCapacity:0];
NSRange startRange = [text rangeOfString:start];
for( ;; )
{
if (startRange.location != NSNotFound)
{
NSRange targetRange;
targetRange.location = startRange.location + startRange.length;
targetRange.length = [text length] - targetRange.location;
NSRange endRange = [text rangeOfString:end options:0 range:targetRange];
if (endRange.location != NSNotFound)
{
targetRange.length = endRange.location - targetRange.location;
[strings addObject:[text substringWithRange:targetRange]];
NSRange restOfString;
restOfString.location = endRange.location + endRange.length;
restOfString.length = [text length] - restOfString.location;
startRange = [text rangeOfString:start options:0 range:restOfString];
}
else
{
break;
}
}
else
{
break;
}
}
NSLog(#"%#",strings);
return strings;
}
Essentially what this is doing is querying a website that looks up the city that a ZIP codes are associated with, then fetching the HTML for a random ZIP code. The program then extracts specific bits of information from that HTML data by searching for text between a unique set of front and end "caps". I've used this "cap" method for a few other sample applications. Some of these do not actually query the website, but fetch data off of a static URL that is updated frequently. One of the only pitfalls I can see here is that if the HTML changes, this may not work. But other than that, it seems to work really well and is extremely quick. Before I publish any of my applications, I want to ensure that a large amount of queries will not damage the websites, or other disadvantages for both me and the webmaster. Is this OK to do? And is there a better alternative? (not for this specific purpose - ZIP codes - but just for pulls in general)
What you're doing is called scraping the web site / page. It's a general approach, but one that isn't ideal and comes with a number of pitfalls...
Generally speaking, you're better off not having any scraping code inside your app, because your app will take quite a while to change and redeploy to the store if the website changes and you need to update.
So, it's best to either have a server of your own do the scraping and then provide your 'sanitised' version of the data to the app, or to use a reconfigurable 3rd party service (like Kimono, I've never used it but the website is colourful) to abstract your app from the nitty gritty.
As for the users, your app / service is just like a normal user, so the website needs to be able to handle the number of users in general.
I agree with the comment from #paulw11 about legality if you don't own / have a relationship with the website involved - you should have a relationship with them...

Get proper format of the text file in HTML form

I am using the following code to convert text to pdf form:
NSString *filePath = [[NSBundle mainBundle] pathForResource:#"All_lang_unicode" ofType:#"txt"];
NSString *str;
NSData *myData = [NSData dataWithContentsOfFile:filePath];
if (myData) {
str = [[NSString alloc] initWithData:myData encoding:NSUTF16StringEncoding];
NSLog(#"STRING : %#",str);
}
NSString *html = [NSString stringWithFormat:#"<body>%#</body>",str];
UIMarkupTextPrintFormatter *fmt = [[UIMarkupTextPrintFormatter alloc]
initWithMarkupText:html];
UIPrintPageRenderer *render = [[UIPrintPageRenderer alloc] init];
[render addPrintFormatter:fmt startingAtPageAtIndex:0];
CGRect page;
page.origin.x=0;
page.origin.y=0;
page.size.width=792;
page.size.height=612;
CGRect printable=CGRectInset( page, 0, 0 );
[render setValue:[NSValue valueWithCGRect:page] forKey:#"paperRect"];
[render setValue:[NSValue valueWithCGRect:printable] forKey:#"printableRect"];
NSLog(#"number of pages %d",[render numberOfPages]);
NSMutableData * pdfData = [NSMutableData data];
UIGraphicsBeginPDFContextToData( pdfData, CGRectZero, nil );
for (NSInteger i=0; i < [render numberOfPages]; i++)
{
UIGraphicsBeginPDFPage();
CGRect bounds = UIGraphicsGetPDFContextBounds();
[render drawPageAtIndex:i inRect:bounds];
}
UIGraphicsEndPDFContext();
NSArray *paths = NSSearchPathForDirectoriesInDomains(NSDocumentDirectory, NSUserDomainMask, YES);
NSString *documentsDirectory = [paths objectAtIndex:0];
NSString * pdfFile = [documentsDirectory stringByAppendingPathComponent:#"test.pdf"];
[pdfData writeToFile:pdfFile atomically:YES];
But problem is that I am not getting the proper formatting of the text. when I print using NSLog(); I get the proper content but when I place the string in STRING the spacing and newline is missing.. all coming in same line. i.e. continuous.
(UPDATE : )
NSLog OUTPUT:(Proper)
NEW DELHI: Sachin Tendulkar's streak of low scores might have raised a question mark over his future but senior BCCI official and IPL chairman Rajiv Shukla on Monday came out in support of the senior batsman saying one needs to look at his "colossal record" before making any comment.
"He will hang up his boots when he thinks it's time for him to go. He does not need any advice on this. Before making a comment on his performance you have to see his colossal record and his past performance," Shukla told reporters outside the Parliament adding that the veteran cricketer will come back strongly in the forthcoming matches.
and Im getting as:
NEW DELHI: Sachin Tendulkar's streak of low scores might have raised a question mark over his future but senior BCCI official and IPL chairman Rajiv Shukla on Monday came out in support of the senior batsman saying one needs to look at his "colossal record" before making any comment. "He will hang up his boots when he thinks it's time for him to go. He does not need any advice on this. Before making a comment on his performance you have to see his colossal record and his past performance," Shukla told reporters outside the Parliament adding that the veteran cricketer will come back strongly in the forthcoming matches.
Can any one please suggest modification in this code so that I can get the proper format.
If I get it right, you should replace your new line characters with <br> or <p>.
Try
str = [str stringByReplacingOccurrencesOfString:#"\n" withString:#"<br>"];
How to detect new lines in Objective-C
Solution of your next question might look like this:
NSArray *words = [str componentsSeparatedByString:#" "];
NSString *line = #"";
NSUInteger maxLineLength = 100;
NSString *resultStr = #"";
for (NSString *word in words) {
if ([line length] + [word length] > maxLineLength) {
resultStr = [resultStr stringByAppendingFormat:#"%#<br>", line];
line = word;
} else {
line = [line stringByAppendingFormat:#" %#", word];
}
}
resultStr = [resultStr stringByAppendingString:line];

How can I parse tables in HTML?

I'm trying to parse an HTML page with a lot of tables. I've searched the net on how to parse HTML with Objective C and I found hpple. I'd look for a tutorial which lead me to:
http://www.raywenderlich.com/14172/how-to-parse-html-on-ios
With this tutorial I tried to parse some forum news which has a lot of tables from this site (Hebrew): news forum
I tried to parse the news title, but I don't know what to write in my code. Every time I try to reach the path I get, "Nodes was nil."
The code of my latest attempt is:
NSURL *contributorsUrl = [NSURL URLWithString:#"http://rotter.net/cgi-bin/listforum.pl"];
NSData *contributorsHtmlData = [NSData dataWithContentsOfURL:contributorsUrl];
// 2
TFHpple *contributorsParser = [TFHpple hppleWithHTMLData:contributorsHtmlData];
// 3
NSString *contributorsXpathQueryString = #"//body/div/center/center/table[#cellspacing=0]/tbody/tr/td/table[#cellspacing=1]/tbody/tr[#bgcolor='#FDFDFD']/td[#align='right']/font[#class='text15bn']/font[#face='Arial']/a/b";
NSArray *contributorsNodes = [contributorsParser searchWithXPathQuery:contributorsXpathQueryString];
// 4
NSMutableArray *newContributors = [[NSMutableArray alloc] initWithCapacity:0];
for (TFHppleElement *element in contributorsNodes) {
// 5
Contributor *contributor = [[Contributor alloc] init];
[newContributors addObject:contributor];
// 6
Could somebody guide me through to getting the titles?
Not sure if that's the option for you, but if desired table have unique id's you could use a messy approach: load that html into UIWebView and get contents via – stringByEvaluatingJavaScriptFromString: like this:
// desired table container's id is "msg"
NSString* value = [webView stringByEvaluatingJavaScriptFromString:#"document.getElementById('msg').innerHTML"];

Getting the HTML tags in hpple as well as text?

The code below takes all of the text from a certain div. Is it possible for me to take all the text from the div as well as the html attributes? So it also adds all of the <p> </p>'s and <br> </br>'s to the string, myString?
//trims string from previous page
NSString *trimmedString = [stringy stringByTrimmingCharactersInSet:
[NSCharacterSet whitespaceAndNewlineCharacterSet]];
NSData *data = [[NSString stringWithContentsOfURL:[NSURL URLWithString:trimmedString]] dataUsingEncoding:NSUTF8StringEncoding];
TFHpple *xpathParser = [[TFHpple alloc] initWithHTMLData:data];
NSArray *elements = [xpathParser searchWithXPathQuery:#"//div[#class='field-item even']"];
TFHppleElement *element = [elements lastObject]; //may need to change this number?!
NSString *mystring = [self getStringForTFHppleElement:element];
trimmedTextView.text = [trimmedTextView.text stringByAppendingString:mystring];
Method here:
-(NSString*) getStringForTFHppleElement:(TFHppleElement *)element
{
NSMutableString *result = [NSMutableString new];
// Iterate recursively through all children
for (TFHppleElement *child in [element children])
[result appendString:[self getStringForTFHppleElement:child]];
// Hpple creates a <text> node when it parses texts
if ([element.tagName isEqualToString:#"text"])
[result appendString:element.content];
return result;
}
Any ideas would be appreciated. Cheers.
Try this:
NSString *htmlDataString = [webView stringByEvaluatingJavaScriptFromString: #"document.documentElement.outerHTML"];
This will take all the HTML out to string. You can then parse it in your native code and find div which is your interest what you have did in above example.
You can do it as well with any DOM element in your HTML like:
NSString *htmlDataString = [webView stringByEvaluatingJavaScriptFromString: #"document.documentElement.getElemenById('mydiv')"];
which is more efficient but requires a bit of javascript skill.

HTML from NSAttributedString

Rather than converting HTML to an attributed string, I need to convert it back to HTML. This can easily be done on Mac as can be seen here: http://www.justria.com/2011/01/18/how-to-convert-nsattributedstring-to-html-markup/
Unfortuately, the method dataFromRange:documentAttributes: is only available on Mac via the NSAttributedString AppKit Additions.
My question is how can you do this on iOS?
Not the 'easy' way, but what about iterating through the attributes of the string using:
- (void)enumerateAttributesInRange:(NSRange)enumerationRange
options:(NSAttributedStringEnumerationOptions)opts
usingBlock:(void (^)(NSDictionary *attrs, NSRange range, BOOL *stop))block
Have an NSMutableString variable to accumulate the HTML (lets call it 'html'). In the block, you would construct the HTML manually using strings. For instance if the text attributes 'attrs' specify red, bold text:
[html appendFormat:#"<span style='color:red; font-weight: bold;'>%#</span>", [originalStr substringWithRange:range]]
EDIT: Stumbled across this yesterday:
NSAttributedString+HTMLFromRange category from "UliKit"
(https://github.com/uliwitness/UliKit/blob/master/NSAttributedString+HTMLFromRange.m)
Looks like it will do what you want.
Use the below code. it works well.
NSAttributedString *s = ...;
NSDictionary *documentAttributes = #{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType};
NSData *htmlData = [s dataFromRange:NSMakeRange(0, s.length) documentAttributes:documentAttributes error:NULL];
NSString *htmlString = [[NSString alloc] initWithData:htmlData encoding:NSUTF8StringEncoding];