Scraping an html with Swift 4 in Xcode 9 - html

Ok I have a website I want to scrape for specific links.
I already used URLSession to put all websites contents into a string.
Now I have to get all Links into an array which have the following structure:
"< a href="/thisIsAlwaysTheSame/ThisIsAUniqueNumber/ThisIsWhatIDontNeed..."
So that I get an array: [href="/thisIsAlwaysTheSame/UniqueNumberA/, href="/thisIsAlwaysTheSame/UniqueNumberB, href="/thisIsAlwaysTheSame/UniqueNumberC, etc.]"
There are many more links on the website, but I only need those which have this format.
Optionally I would also be happy if I get only the UniqueNumbers into an array.
I already asked this question on reddit, but didn't get sufficient answers:
https://www.reddit.com/r/swift/comments/7256vi/scraping_an_html_with_swift_4_in_xcode_9/
Here is what I know already from my research and the answers on reddit already:
"Kanna" is suggested --> I can't get it running in Xcode 9 (I already opened an issue on GitHub)
SwiftSoup could be an option --> Same problem like Kanna, can't get it running in Xcode 9 (I also opened an Issue on Github)
I got the advice that I can solve my problem with the Swift string class reading up following link: https://developer.apple.com/documentation/swift/string --> I read it but don't really see how I can solve my problem with these methods. Perhaps I am missing something there?
Any advice? Thanks for your help!

I used following code after adding SwiftSoup:
guard let linkElements: Elements = try SwiftSoup.parse(myLinkHTMLContent).select("a") else {return}
// Now all elements are printed into an array
for element: Element in linkElements.array(){
myLinksArray.append("\(element)")
}

If I understand correctly, you want to extract all URLs from a HTML String. You can do so by adding a loop which checks the String for any URLs:
let detector = try! NSDataDetector(types: NSTextCheckingResult.CheckingType.link.rawValue)
let matches = detector.matches(in: content, options: [], range: NSRange(location: 0, length: content.utf16.count))
for match in matches {
let url = (content as NSString).substring(with: match.range)
if url.contains("ThisIsWhatIDontNeed") {
//do smtg
} else {
self.img_urls.append(url)
}
}

Related

Swiftsoup parsing is not finding all HTML classes

I have a method to parse website with using Swiftsoup go get the price of a product:
#objc func actionButtonTapped(){
let url = "https://www.overkillshop.com/de/c2h4-interstellar-liaison-panelled-zip-up-windbreaker-r001-b012-vanward-black-grey.html"
let url2 = "https://www.asos.com/de/asos-design/asos-design-schwarzer-backpack-mit-ringdetail-und-kroko-muster/prd/14253083?clr=schwarz&colourWayId=16603012&SearchQuery=&cid=4877"
do {
let html: String = getHTMLfromURL(url: url2)
let doc: Document = try SwiftSoup.parse(html)
let priceClasses: Elements = try doc.select("[class~=(?i)price]")
for priceClass: Element in priceClasses.array() {
let priceText : String = try priceClass.text()
print(try priceClass.className())
print("pricetext: \(priceText)")
}
} catch Exception.Error(let type, let message) {
print(message)
} catch {
print("error")
}
}
The method works fine for url but for url2 it is not printing all all the classNames even though they match the regex. This is where the price actually is:
<span data-id="current-price" data-bind="text: priceText(), css: {'product-price-discounted' : isDiscountedPrice }, markAndMeasure: 'pdp:price_displayed'" class="current-price">36,99 €</span>
The output of the function is this:
product-price pricetext:
stock-price-retry-oos
pricetext:
stock-price-retry
pricetext:
It is not printing class=current-price. Is something wrong with my regex or why does it not find that class??
EDIT:
I found out that the price is not actually inside the HTML of url2. Only the classes that are actually printed out are inside. What's the reason for that and how can I solve that?
The html is not static. It can change over time. If you make a get request to the site's URL you will get the initial value of the html for that site.
But on browsers there is this thing, called javascript, that can make the page's HTML change over time. It's quite common actually:
- The site gets loaded at first with some javascript
- The javascript (developed by the site's creator) than runs and does stuff
- Content dynamically changes by calling some API by that javascript
You can't scrape that content by HTML scraping of the base URL.
If you ask me how I'd do that anyway, is by looking for the site's HTTP requests where it gets the content. Look at that API and use that API myself. Get the data, and store it in some of my servers.
Than on the client I call my server's API to get that data.
Also I'm not really sure that's legal.
But, as far as I understood by your last couple questions, you don't want to do that.
If you really need to do that on the client, you can use WKWebView, load the page, wait for the content to show up, and then get the current HTML of the page by doing something like this:
webView.evaluateJavaScript("document.documentElement.outerHTML.toString()",
completionHandler: { (html: Any?, error: Error?) in
print(html)
})
Look at this answer for more about this.
I hope this solves all of your problem, because I think I don't have much more time to help you :D

Angular: Routing between pages using condition

I am trying to route between pages using basic if condition in Angular.
GoToHome() {
if(this.router.url=='/chat'){
console.log(this.router.url)
this.router.navigate(['login']);
} else {
this.router.navigate(['people']);
}
}
The problem is that the route chat isn't really correct, there are many pages in chat (chat\x , chat\y and many others) I want that it will work for all the pages in chat, but right now it doesn't work. If I write a specific route like chat\x it does work, but only for x. Is there a way to do it for all?
you can read and check Guards. Read about CanActivate method, maybe it will help you?
RouteGuards might do a better job of handling the redirects as per your requirement.
But a quick workaround would be to do a split() on the URL and compare for the chat part. Try the following
if(((this.router.url).split('/')[1]) === 'chat') {
// proceed
}
As other had said, best solution is to use Angular Guard https://medium.com/#ryanchenkie_40935/angular-authentication-using-route-guards-bf7a4ca13ae3.
Anyway to resolve your problem you can use startsWith() function which determines whether a string begins with the characters of a specified string.
GoToHome() {
if((this.router.url).startsWith('/chat'){
console.log(this.router.url)
this.router.navigate(['login']);
} else {
this.router.navigate(['people']);
}
}

Angular 5 httpClient get Request vscode json error

My code is working perfectly, I just have a problem where my data from the get request is underlined in red and I don't understand how to fix this. The alert works perfectly. It alerts the SessionID that I need. I would just like to know how I can remove this underlined error, or maybe I am not doing the get request correctly? Thank you for any help :)
try below code :
this.httpClient.get('hidden').subscribe((myData:any)=>{
alert(myData.utLogon_responce.sessionId)
})
add : any
thanks,
The red wiggly lines that show up are indicating the type error. This indicates that it cannot determine the utLogon_response in type definition of myData variable. This can be solved by either of following ways:
Make myData as type any (simple, quick and easier)
this.httpClient.get('hidden')
.subscribe((myData: any) => {
alert(myData.utLogon_response.sessionId);
})
However, my understanding is that one should use any only in extreme conditioms since this is more like defeating the very purpose of Typescript.
Define proper type for myData and use it (simple, slight coding but most appropriate)
interface IHiddenData {
utLogon_response: any {
sessionId: string;
}
}
this.httpClient.get('hidden')
.subscribe((myData: IHiddenData) => {
alert(myData.utLogon_response.sessionId);
})
I hope this helps!
NOTE: Somehow, this code is not getting properly formatted by the editor and I do know how to set it right.

Embedded Tweet not displaying properly in UIWebView swift

I am getting a HTML in a key of JSON response from API call. I load that HTML on UIWebView.
Everything is displayed but with "twitter-tweet" tag the view is not being created as tweet only text is shown.
this is how i want to display on the simulaor but
This is how its being displayed.
The HTML which is coming like this
This way worked for me, please go step by step
Sample Twitter Resonse HTML Look like (Remove " in text to make correct string)
let responseTwitterHTMLContent = " <blockquote class=\"twitter-tweet\" data-width=\"500\">\n<p lang=\"en\" dir=\"ltr\"><a href=\"https:\/\/twitter.com\/kalyansury\">#kalyansury<\/a> Yes, last night. With a really lame and bleeding obvious response. I'm thinking of my next steps. <a href=\"https:\/\/twitter.com\/HDFC_Bank\">#HDFC_Bank<\/a> <a href=\"https:\/\/twitter.com\/HDFCBank_Cares\">#HDFCBank_Cares<\/a><\/p>\n<p>— Karthik (#beastoftraal) <a href=\"https:\/\/twitter.com\/beastoftraal\/status\/826589530748813313\">February 1, 2017<\/a><\/p><\/blockquote> "
If your response doesn't include this twitter script
<script async src=\"\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script>
Include twitter script like below in head tag programatically
<head><script async src=//platform.twitter.com/widgets.js charset=utf-8></script></head>
then final Modified HTML content look like
let ModifyHtmlcontent = "<html><head><script async src=//platform.twitter.com/widgets.js charset=utf-8></script><style></head><body style = width: 100%; height:auto; overflow:hidden; margin:100px; text-align:justify;>\(responseTwitterHTMLContent)</body></html>"
Load this content in webview
If I load above html it doesn't show twitter format look coz twitter script which we included above not downloaded by uiwebview for that you need to give base url https:
let url: URL = URL(string: "https:")!
self.yourWebView.loadHTMLString(ModifyHtmlcontent, baseURL: url)
If base url is creating problem then change the script add https: directly in twitter script And make base url to nil
<head><script async src=https://platform.twitter.com/widgets.js charset=utf-8></script></head>
self.yourWebView.loadHTMLString(ModifyHtmlcontent, baseURL: nil)
make sure you have given script async
Happy Coding!
Reference

Display JSON object nicely with Syntax Hihjlighter

I'm trying to display a JSON object nicely (this means on several lines with indentation) with Alex Gorbatchev plugin : http://alexgorbatchev.com/SyntaxHighlighter/
Unfortunately, it all displays on a single line.
I'm using the javascript brush.
I've created a code pen : http://codepen.io/hugsbrugs/pen/XJVjjP?editors=101
var json_object = {"hello":{"my_friend":"gérard", "my_dog":"billy"}};
$('#nice-json').html('<pre class="brush: javascript">' + JSON.stringify(json_object) + '</pre>');
SyntaxHighlighter.highlight();
Please don't give a list of other plugins since I know there is a bunch but I don't want to load additional plugins ... I'd like to achieve it with this plugin.
Thanks for your help
Try indenting the json with the stringify method.
JSON.stringify(json_object, undefined, 2);
You can use the optional third parameter of JSON.stringify(...) which is the space argument.
Change:
JSON.stringify(json_object)
to:
JSON.stringify(json_object, null, '\t')
Here is your codepen updated to show the result of the above modifications. The above modification causes your JSON to be pretty printed over multiple lines.