I want to grab text from a list of web pages. I've done a bit of experimenting and found that the best way for my needs is via WebKit.
Once the source of the page has been grabbed, I want to strip out all the HTML tags, by using the technique in this comment.
Here's my code:
- (void)webView:(WebView *)sender didFinishLoadForFrame:(WebFrame *)frame {
if(frame == [sender mainFrame]) {
NSString *content = [[[[sender mainFrame] dataSource] representation] documentSource];
NSXMLDocument *theDocument = [[NSXMLDocument alloc] initWithXMLString:content options:NSXMLDocumentTidyHTML error:&theError];
NSString *theXSLTString = #"<?xml version='1.0' encoding='utf-8'?>\n<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform' xmlns:xhtml='http://www.w3.org/1999/xhtml'>\n<xsl:output method='text'/>\n<xsl:template match='xhtml:head'></xsl:template>\n<xsl:template match='xhtml:script'></xsl:template>\n</xsl:stylesheet>";
NSData *theData = [theDocument objectByApplyingXSLTString:theXSLTString arguments:nil error:&theError];
NSString *theString = [[NSString alloc] initWithData:theData encoding:NSUTF8StringEncoding];
}
}
This works fine on most pages. However, if a page doesn't validate correctly as XHTML, I sometimes get an error from my initWithXMLString: method.
That's fair enough - I'm asking it to tidy up the XHTML, so I'd expect it to report what problems it's encountered. But if there's a problem with the validation, it returns nil and an error rather than actually tidying up the XHTML.
One specific page that's causing the problem is the Ruby class documentation.
I've found that the excellent third party HTML tidy application can clean up this XHTML fine, but I'd expect NSXMLDocumentTidyHTML to be able to just add some quotes around cellpadding values. It's a fairly basic cleanup operation. And I'm not keen to add another dependency into my code base.
Is there something I'm missing with the way Cocoa cleans up XHTML? Or do I just need to bite the bullet and use HTML Tidy instead in my code?
XHTML documents are treated as XML, so you may have better luck with the NSXMLDocumentTidyXML flag.
Related
I have one weird requirement that in my existing app I have Text2Speech and for that, I have used AVSpeechSynthesizer to speech text, but now the requirement changed and now I need to convert HTML files data to text something like HTML2Speech.
One Solution we can think:
use HTML parsing and get all text from HTML and use same framework
for Text2Speech.
But the client doesn't want that type of parsing and he wants any API or framework which is providing directly HTML2Speech feature.
Any suggestion or help will be highly appreciated.
As I have worked with HTML parsing and text2speech here you can go with 2 steps
1.get Attribute string from HTML file with below code works in iOS7+
As per your client perspective : if there is any API in market for HTML2Speech may be its Paid or
you are depended on that API if you use any. While Native framework
will help same what you/client wants.
Step 1:
[[NSAttributedString alloc] initWithData:[htmlString dataUsingEncoding:NSUTF8StringEncoding]
options:#{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
NSCharacterEncodingDocumentAttribute: #(NSUTF8StringEncoding)}
documentAttributes:nil error:nil];
Then you can pass this Attributed String in AVSpeechUtterance
Step 2:
use below method to get HTML2String:
/**
* "ConvertHTMLtoStrAndPlay" : This method will convert the HTML to String
synthesizer.
*
* #param aURLHtmlFilePath : "object of html file path"
*/
-(void)ConvertHTMLtoStrAndPlay:(UIButton*)aBtnPlayPause
isSpeechPaused:(BOOL)speechPaused
stringWithHTMLAttributes:(NSAttributedString*)aStrWithHTMLAttributes
{
if (synthesizer.speaking == NO && speechPaused == NO) {
AVSpeechUtterance *utterance = [[AVSpeechUtterance alloc] initWithString:aStrWithHTMLAttributes.string];
//utterance.rate = AVSpeechUtteranceMinimumSpeechRate;
if (IS_ARABIC) {
utterance.voice = [AVSpeechSynthesisVoice voiceWithLanguage:#"ar-au"];
}else{
utterance.voice = [AVSpeechSynthesisVoice voiceWithLanguage:#"en-au"];
}
[synthesizer speakUtterance:utterance];
}
else{
[synthesizer pauseSpeakingAtBoundary:AVSpeechBoundaryImmediate];
}
if (speechPaused == NO) {
[synthesizer continueSpeaking];
} else {
[synthesizer pauseSpeakingAtBoundary:AVSpeechBoundaryImmediate];
}
}
and as usual while you need to stop use below code to stop Speech.
/**
* "StopPlayWithAVSpeechSynthesizer" : this method will stop the playing of audio on the application.
*/
-(void)StopPlayWithAVSpeechSynthesizer{
// Do any additional setup after loading the view, typically from a nib.
[synthesizer stopSpeakingAtBoundary:AVSpeechBoundaryImmediate];
}
Hope This will help you to get HTML2Speech feature.
There's two parts to a solution here...
Presumably you don't care about the formatting in the HTML--after all, by the time it gets to the speech synthesizer, this text is to be spoken, not viewed. AVSpeechSynthesizer takes plain text, so you just need to get rid of the HTML markup. One easy way to do that is to create an NSAttributedString from the HTML, then ask that attributed string for its underlying plain-text string to pass text to the synthesizer.
In iOS 10 you don't even have to extract the string from an attributed string — you can pass an attributed string directly to AVSpeechUtterance.
One way or another it will always be parsing HTML to something else if you don't want to read files. If the client want direct HTML2Speech solution you can provide a method that takes html file as an argument and read it. What's happening with this file under the hood should not bother client that much as long as it's clean and not causing problems.
What happen when client will ask for Markdown2Speech or XML2Speech. For what i see in your desciption is better to have it for now in one framework with two public methods Text2Speech and HTML2Speech that will take as argument link to file or NSString.
So as #rickster suggest it can be NSAttributedString or NSString. There is a lot of parsers out there, Or if you want own solution you can remove everything what's inside < and > and change encoding.
The safest method will be to extract the text and use existing text2speech API.
Though if you are sure that the browser will be chrome then Speech Synthesis API maybe helpful. But this API still not fully adopted by all browsers; it will be a risky solution.
You can find necessary info regarding this API at
https://developers.google.com/web/updates/2014/01/Web-apps-that-talk-Introduction-to-the-Speech-Synthesis-API?hl=en
https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#examples-synthesis
https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API
There is no direct API for HTML to Speech except Speech Synthesis API mentioned above. Though you can try http://responsivevoice.org/. But I think this one is also based on browser's Speech Synthesis or Speech generation at server. So to use this one, you would have to extract text and pass the text to API to get the speech
My app parses an xml, and builds its own custom HTML from the contents of the article chosen in the XML. When I save an article, I have a class for the action, in which I pass the article title, and the custom HTML to strings within the Save class. The class takes that and saves it to the app using:
NSArray *paths = NSSearchPathForDirectoriesInDomains(NSDocumentDirectory, NSUserDomainMask, YES);
NSString *documentsDirectory = [paths objectAtIndex:0];
NSString *pdfPath = [documentsDirectory stringByAppendingPathComponent:[thetitle stringByAppendingString:#".html"]];
NSError *error = nil;
[thehtmlcontents writeToFile:pdfPath atomically:YES encoding:NSUTF8StringEncoding error:&error];
The issue that I have is that if I want to share a saved article via Facebook or Twitter, I can't, because the URL doesn't get saved with everything else. I can pass over the URL easy enough to the Save class, but I'm unsure of what to do with it, so that it stays associated with the article itself. Suggestions?
I'd say you broadly have three options:
Attach some metadata to the file noting the URL it corresponds to
Write out a file format that encapsulates the URL, plus the HTML
Include the URL in the HTML in a manner such that you can retrieve it
no. 1 would probably be best achieved by setting an extended attribute on the file. However, I'm not sure how well iOS supports this, and there may well be issues with it not being preserved in the event of something like restoring the OS.
Are you in a position to implement no. 3 reasonably cleanly? I would say a <meta> tag near the top of the document is best for doing this.
All that said, how important really is it that your HTML is stored in files? To me, this sounds like it could easily be chucked into a dirt simple Core Data database.
I'm trying to make a WebView load a page from HTML code I have stored as an NSData. I get a blank page when I try to do this. Is there anything wrong with what I'm doing when I load the page? If not, I need to look elsewhere in my program.
if (essence.html){ //essence.html is an NSData
NSLog(#"Inserting HTML code into browser window: %#", [[NSString alloc] initWithData:essence.html encoding: NSUTF8StringEncoding]);
[webView.mainFrame loadData:essence.html MIMEType: #"text/html" textEncodingName: #"utf-8" baseURL:nil]; //webView is a WebView
}
I created the conditions so essence.html contains HTML code from the page http://kathleenmelian.com/test.html (which just says "hello"). The NSLog prints this when the above code runs:
Inserting HTML code into browser window: <html><head></head><body>hello</body></html>
So essence.html definitely contains valid code that a browser should be able to load.
You could use
[webView loadHTMLString:
[[NSString alloc] initWithData:essence.html encoding:NSUTF8StringEncoding]
baseURL:nil];
The other idea would be to replace "utf-8" with "UTF-8", which in some cases is known to make a difference (not sure about UIWebView).
Sorry to bother you guys. I fixed some other bug in my program's model, and that somehow fixed THIS problem as well. I don't know how. Crisis averted.
I am trying to pull down the code from an HTML website that has no more than 2 lines on it. The code contains a word that I need to retrieve. Is there a simple way to pull down that code and put it in an NSString?
Further details: I am going to have an app that checks for a word on a page. If that word is what I am looking for, the app will show the text "confirmed". The purpose of the app is to check to see if the page is accessible.
If you need a http library to hit the server try asihttp. Apart from this i need more info of what you are trying to do...
If you just want to check if that website is reachable, you can go with HTTP Success Status Codes.
Using ASIHTTPRequest simplifies communication over the web.
If you still want to evaluate the text on that website, can also just retrieve it using:
[request responseString];
Depending on what you get from the website, it's up to you how to update the UI.
Just change the link between the quotes and it'll work!
-(void) viewDidLoad {
NSString * sFeedURL = [NSString stringWithFormat:#"http://www.google.com/ig/api?weather=,,,270000,960000"];
//RSS Feed URL goes between quotes
NSString * sActualFeed = [NSString stringWithContentsOfURL:[NSURL URLWithString:sFeedURL] encoding:1 error:nil];
NSLog(#"%#", sActualFeed);
}
I'm trying to display HTML source code in my NSDocument based application. However, it renders the page as Safari would show it.
Here's the code that I use to open HTML:
NSData*data;
NSMutableDictionary *dict = [NSDictionary dictionaryWithObject:NSHTMLTextDocumentType
forKey:NSDocumentTypeDocumentOption];
data = [NSData dataWithContentsOfFile:[self fileName]];
mString = [[NSAttributedString alloc]
initWithData:data options:dict documentAttributes:NULL
error:outError];
What am I doing wrong?
The correct solution is a mix of your original code and the bogus solution I gave you in my previous answer (which I've deleted). Use NSPlainTextDocumentType as the type, as you were doing originally, but use initWithData:options:documentAttributes:error:, not initWithHTML:options:documentAttributes:.
Alternatively, create a plain NSString holding the source code, and then create an attributed string with that plain string plus whatever attributes you want to apply to the whole document (e.g., fixed-pitch font).