Swiftsoup parsing is not finding all HTML classes - html

I have a method to parse website with using Swiftsoup go get the price of a product:
#objc func actionButtonTapped(){
let url = "https://www.overkillshop.com/de/c2h4-interstellar-liaison-panelled-zip-up-windbreaker-r001-b012-vanward-black-grey.html"
let url2 = "https://www.asos.com/de/asos-design/asos-design-schwarzer-backpack-mit-ringdetail-und-kroko-muster/prd/14253083?clr=schwarz&colourWayId=16603012&SearchQuery=&cid=4877"
do {
let html: String = getHTMLfromURL(url: url2)
let doc: Document = try SwiftSoup.parse(html)
let priceClasses: Elements = try doc.select("[class~=(?i)price]")
for priceClass: Element in priceClasses.array() {
let priceText : String = try priceClass.text()
print(try priceClass.className())
print("pricetext: \(priceText)")
}
} catch Exception.Error(let type, let message) {
print(message)
} catch {
print("error")
}
}
The method works fine for url but for url2 it is not printing all all the classNames even though they match the regex. This is where the price actually is:
<span data-id="current-price" data-bind="text: priceText(), css: {'product-price-discounted' : isDiscountedPrice }, markAndMeasure: 'pdp:price_displayed'" class="current-price">36,99 €</span>
The output of the function is this:
product-price pricetext:
stock-price-retry-oos
pricetext:
stock-price-retry
pricetext:
It is not printing class=current-price. Is something wrong with my regex or why does it not find that class??
EDIT:
I found out that the price is not actually inside the HTML of url2. Only the classes that are actually printed out are inside. What's the reason for that and how can I solve that?

The html is not static. It can change over time. If you make a get request to the site's URL you will get the initial value of the html for that site.
But on browsers there is this thing, called javascript, that can make the page's HTML change over time. It's quite common actually:
- The site gets loaded at first with some javascript
- The javascript (developed by the site's creator) than runs and does stuff
- Content dynamically changes by calling some API by that javascript
You can't scrape that content by HTML scraping of the base URL.
If you ask me how I'd do that anyway, is by looking for the site's HTTP requests where it gets the content. Look at that API and use that API myself. Get the data, and store it in some of my servers.
Than on the client I call my server's API to get that data.
Also I'm not really sure that's legal.
But, as far as I understood by your last couple questions, you don't want to do that.
If you really need to do that on the client, you can use WKWebView, load the page, wait for the content to show up, and then get the current HTML of the page by doing something like this:
webView.evaluateJavaScript("document.documentElement.outerHTML.toString()",
completionHandler: { (html: Any?, error: Error?) in
print(html)
})
Look at this answer for more about this.
I hope this solves all of your problem, because I think I don't have much more time to help you :D

Related

Angular 5 httpClient get Request vscode json error

My code is working perfectly, I just have a problem where my data from the get request is underlined in red and I don't understand how to fix this. The alert works perfectly. It alerts the SessionID that I need. I would just like to know how I can remove this underlined error, or maybe I am not doing the get request correctly? Thank you for any help :)
try below code :
this.httpClient.get('hidden').subscribe((myData:any)=>{
alert(myData.utLogon_responce.sessionId)
})
add : any
thanks,
The red wiggly lines that show up are indicating the type error. This indicates that it cannot determine the utLogon_response in type definition of myData variable. This can be solved by either of following ways:
Make myData as type any (simple, quick and easier)
this.httpClient.get('hidden')
.subscribe((myData: any) => {
alert(myData.utLogon_response.sessionId);
})
However, my understanding is that one should use any only in extreme conditioms since this is more like defeating the very purpose of Typescript.
Define proper type for myData and use it (simple, slight coding but most appropriate)
interface IHiddenData {
utLogon_response: any {
sessionId: string;
}
}
this.httpClient.get('hidden')
.subscribe((myData: IHiddenData) => {
alert(myData.utLogon_response.sessionId);
})
I hope this helps!
NOTE: Somehow, this code is not getting properly formatted by the editor and I do know how to set it right.

JavaFX WebView: link to anchor in document doesn't work using loadContent()

Note: This is about JavaFX WebView, not Android WebView (i. e. I have seen "Android Webview Anchor Link (Jump link) not working").
I display a generated HTML page inside a javafx.scene.web.WebView that contains anchors and links to those anchors like this:
<p>Jump to Introduction</p>
some text ...
<h1 id="introduction">Introduction</h1>
more text ...
I use this code to load the HTML into the WebView:
public void go(String location) {
try {
// read the content into a String ...
String html = NetUtil.readContent(new URL(location), StandardCharsets.UTF_8);
// ... and use loadContent()
webview.getEngine().loadContent(html);
} catch (IOException e) {
LOG.error(e);
}
}
Everything is rendered correctly, but if I click on the link named "Introduction", nothing happens.
The HTML however is correct, which I checked by instead using this code:
public void go(String location) {
// use load() to directly load the URL
webview.getEngine().load(location);
}
Now, everything worls fine.
The problem seems to be somehow because the document URL of the WebView is null when using loadContent(), but since it's a readonly property, I have no idea how to make it work.
I need to use loadContent(), because the HTML is generated on the fly, and if possible in any way, I don't want to have to write it out to a file just to make anchor links working. Is there a way to fix this?
EDIT
I filed a bug for JavaFX.
It's probably another WebEngine bug. A lot of that code is just a native libraries wrapped in api, so we can't modify it in runtime to fix some disabilities.
If you are able to change the structure of generated file you can implement scrolling to element in js:
<script>
function scrollTo(elementId) {
document.getElementById(elementId).scrollIntoView();
}
</script>
<a href='#' onclick=scrollTo('CX')>Jump to Chapter X</a>
<h2 id="CX">Chapter X</h2>
If you can't change the structure, there is some steps that I've made to try to fix it and some suggestions - at first I've set value of location by reflections after loadContent for sure:
Field locationField = WebEngine.class.getDeclaredField("location");
locationField.setAccessible(true);
ReadOnlyStringWrapper location = (ReadOnlyStringWrapper) locationField.get(engine);
location.set("local");
But in fact, keeping state of actual location is just an information for you and manipulating this changes nothing. I've also found a way to set url from js (just a long shot, we don't have any specific details why it's not working):
window.history.pushState("generated", "generated", '/generated');
Of course we can't because of:
SecurityError: DOM Exception 18: An attempt was made to break through the security policy of the user agent.
I think you should forget about loadContent(). You said that you didn't want to write generated content to file. A little dirty hack but really helpful for you could be wrapped http server on random and unused port in your application. You don't even need external libraries because Java has simple utilities like that:
HttpServer server = HttpServer.create(new InetSocketAddress(25000), 0);
server.createContext("/generated", httpExchange -> {
String content = getContent();
httpExchange.sendResponseHeaders(200, content.length());
OutputStream os = httpExchange.getResponseBody();
os.write(content.getBytes());
os.close();
});
server.setExecutor(null);
server.start();
You can also use another browser to display your page, e.g. JCEF (Java Chromium Embedded Framework).

Scraping an html with Swift 4 in Xcode 9

Ok I have a website I want to scrape for specific links.
I already used URLSession to put all websites contents into a string.
Now I have to get all Links into an array which have the following structure:
"< a href="/thisIsAlwaysTheSame/ThisIsAUniqueNumber/ThisIsWhatIDontNeed..."
So that I get an array: [href="/thisIsAlwaysTheSame/UniqueNumberA/, href="/thisIsAlwaysTheSame/UniqueNumberB, href="/thisIsAlwaysTheSame/UniqueNumberC, etc.]"
There are many more links on the website, but I only need those which have this format.
Optionally I would also be happy if I get only the UniqueNumbers into an array.
I already asked this question on reddit, but didn't get sufficient answers:
https://www.reddit.com/r/swift/comments/7256vi/scraping_an_html_with_swift_4_in_xcode_9/
Here is what I know already from my research and the answers on reddit already:
"Kanna" is suggested --> I can't get it running in Xcode 9 (I already opened an issue on GitHub)
SwiftSoup could be an option --> Same problem like Kanna, can't get it running in Xcode 9 (I also opened an Issue on Github)
I got the advice that I can solve my problem with the Swift string class reading up following link: https://developer.apple.com/documentation/swift/string --> I read it but don't really see how I can solve my problem with these methods. Perhaps I am missing something there?
Any advice? Thanks for your help!
I used following code after adding SwiftSoup:
guard let linkElements: Elements = try SwiftSoup.parse(myLinkHTMLContent).select("a") else {return}
// Now all elements are printed into an array
for element: Element in linkElements.array(){
myLinksArray.append("\(element)")
}
If I understand correctly, you want to extract all URLs from a HTML String. You can do so by adding a loop which checks the String for any URLs:
let detector = try! NSDataDetector(types: NSTextCheckingResult.CheckingType.link.rawValue)
let matches = detector.matches(in: content, options: [], range: NSRange(location: 0, length: content.utf16.count))
for match in matches {
let url = (content as NSString).substring(with: match.range)
if url.contains("ThisIsWhatIDontNeed") {
//do smtg
} else {
self.img_urls.append(url)
}
}

Embedded Tweet not displaying properly in UIWebView swift

I am getting a HTML in a key of JSON response from API call. I load that HTML on UIWebView.
Everything is displayed but with "twitter-tweet" tag the view is not being created as tweet only text is shown.
this is how i want to display on the simulaor but
This is how its being displayed.
The HTML which is coming like this
This way worked for me, please go step by step
Sample Twitter Resonse HTML Look like (Remove " in text to make correct string)
let responseTwitterHTMLContent = " <blockquote class=\"twitter-tweet\" data-width=\"500\">\n<p lang=\"en\" dir=\"ltr\"><a href=\"https:\/\/twitter.com\/kalyansury\">#kalyansury<\/a> Yes, last night. With a really lame and bleeding obvious response. I'm thinking of my next steps. <a href=\"https:\/\/twitter.com\/HDFC_Bank\">#HDFC_Bank<\/a> <a href=\"https:\/\/twitter.com\/HDFCBank_Cares\">#HDFCBank_Cares<\/a><\/p>\n<p>— Karthik (#beastoftraal) <a href=\"https:\/\/twitter.com\/beastoftraal\/status\/826589530748813313\">February 1, 2017<\/a><\/p><\/blockquote> "
If your response doesn't include this twitter script
<script async src=\"\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script>
Include twitter script like below in head tag programatically
<head><script async src=//platform.twitter.com/widgets.js charset=utf-8></script></head>
then final Modified HTML content look like
let ModifyHtmlcontent = "<html><head><script async src=//platform.twitter.com/widgets.js charset=utf-8></script><style></head><body style = width: 100%; height:auto; overflow:hidden; margin:100px; text-align:justify;>\(responseTwitterHTMLContent)</body></html>"
Load this content in webview
If I load above html it doesn't show twitter format look coz twitter script which we included above not downloaded by uiwebview for that you need to give base url https:
let url: URL = URL(string: "https:")!
self.yourWebView.loadHTMLString(ModifyHtmlcontent, baseURL: url)
If base url is creating problem then change the script add https: directly in twitter script And make base url to nil
<head><script async src=https://platform.twitter.com/widgets.js charset=utf-8></script></head>
self.yourWebView.loadHTMLString(ModifyHtmlcontent, baseURL: nil)
make sure you have given script async
Happy Coding!
Reference

Extracting data from API to html elements

I am not an experienced coder so excuse me if my explanation isn't perfect.
I'm making an html page and I'd like there to be a section that shows some Osu! stats. There's this osu api that spits out all of the information I could possibly need but there's a litle bit too much of it.
https://osu.ppy.sh/api/get_user?k=ff96ad02d159e0acad3282ad33e43a710dac96d5&u=Sceleri
The above returns:
[{"user_id":"6962718","username":"Sceleri","count300":"93129","count100":"15744","count50":"3404","playcount":"776","ranked_score":"184300015","total_score":"258886799","pp_rank":"345687","level":"34.115","pp_raw":"314.239","accuracy":"94.54791259765625","count_rank_ss":"1","count_rank_s":"55","count_rank_a":"74","country":"FI","pp_country_rank":"4112","events":[]}]
I'd like to parse a few numbers from there. Example:
"pp_raw":"314.239" -> <p>;314.239</p>;
The <p> would be inside a div and so on, where I can specify some CSS to it and make it look good. The main problem is extracting the data to separate <p> elements.
I have executed this with regex in Rainmeter before (I had help) but I have no idea how to do it in html.
Use Jquery ajax calls. The url you posted basically gives you a json object.
HTML:
<div id="pp_raw">
</div>
Jquery
$.get( "https://osu.ppy.sh/api/get_user?k=ff96ad02d159e0acad3282ad33e43a710dac96d5&u=Sceleri", function( data ) {
//You can put whatever you want in the style attr to make things pretty
$( "#pp_raw" ).html("<p style='color:red'>"+data[0]['pp_raw']+"</p> ");
});
JSFiddle:
https://jsfiddle.net/rwt5mdyk/8/