On the web page there are a few articles. I need to get links to all articles.
I use Selenium and Powershell.
I do a search with:
FindElementByXPath("//*[contains(#class, 'without')]").getattribute("href")`
but only get a link to the first article.
How to get links to all the articles?
All links articles view:
<a class="without" href="http://articlelink.html"><h2>article</h2></a>
I don't know anything about Powershell But Using java with selenium you can do this like below mentioned code.
I know it's not a proper answer to deal with, but below code will give you the hint, that how you should go with other language.
List<WebElement> links = driver.findElements(By.className("without")); // Using list web-element get all web-elements, whose classname name as "without"
System.out.println(links.size()); //total number of links on the page.
for(int i = 0;i<links.size();i++)
{
System.out.println(links.get(i).getAttribute("href")); //Using for loop getting one by one links name.
links.get(i).click(); // click the link if you want to click
Thread.sleep(2500); //wait for 2.5 seconds
}
Hope my above answer will help you.
Related
Im trying to scrape a div class but everything I have tried has failed so far :(
Im trying to scrape the element(s):
<a href="http://www.bellator.com/events/d306b5/bellator-newcastle-pitbull-vs-
scope"><div class="s_buttons_button s_buttons_buttonAlt
s_buttons_buttonSlashBack">More info</div></a>
from the website: http://www.bellator.com/events
I tried accessing the list of elements by doing
Elements elements = document.select("div[class=s_container] > li");
but that didnt return anything.
Then i tried accessing just the parent with
Elements elements = document.select("div[class=s_container]");
and that returned two div with classname "s_container", non of which is the one I needed :<
then i tried accessing that ones parent with
Elements elements = document.select("div[class=ent_m152_bellator module
ent_m152_bellator_V1_1_0 ent_m152]");
And that didnt return anything
I also tried
Elements elements = document.select("div[class=ent_m152_bellator]");
because I wasnt sure about the white spaces but it didnt return anything either
Then I tried accessing its parent by
Elements elements = document.select("div#t3_lc");
and that worked, but it returned an element containing
<div id="t3_lc">
<div class="triforce-module" id="t3_lc_promo1"></div>
</div>
which is kinda weird because i cant see that it has that child when i inspect the website in chrome :S
Anyone knows whats going on? I feel kinda lost..
What you see in your web browser is not what Jsoup sees. Disable JavaScript and refresh page to get what Jsoup gets OR press CTRL+U ("Show source", not "Inspect"!) in your browser to see original HTML document before JavaScript modifications. When you use your browser's debugger it shows final document after modifications so it's not not suitable for your needs.
It seems like whole "UPCOMING EVENTS" section is dynamically loaded by JavaScript.
Even more, this section is asynchronously loaded with AJAX. You can use your browsers debugger (Network tab) to see every possible request and response.
I found it but unfortunately all the data you need is returned as JSON so you're going to need another library to parse JSON.
That's not the end of the bad news and this case is more complicated. You could make direct request for the data:
http://www.bellator.com/feeds/ent_m152_bellator/V1_1_0/d10a728c-547e-4a6f-b140-7eecb67cff6b
but the URL seems random and few of these URLs (one per upcoming event?) are included inside JavaScript code in HTML.
My approach would be to get the URLs of these feeds with something like:
List<String> feedUrls = new ArrayList<>();
//select all the scripts
Elements scripts = document.select("script");
for(Element script: scripts){
if(script.text().contains("http://www.bellator.com/feeds/")){
// here use regexp to get all URLs from script.text() and add them to feedUrls
}
}
for(String feedUrl : feedUrls){
// iterate over feed URLs, download each of them
String json = Jsoup.connect(feedUrl).ignoreContentType(true).get().body().toString();
// here use JSON parsing library to get the data you need
}
ALTERNATIVE approach would be to stop using Jsoup because of its limitations and use Selenium Webdriver as it supports dynamic page modifications by JavaScript so you'd get the HTML of the final result - exactly what you see in web browser and Inspector.
If anyone finds this in the future; I managed to solve it with Selenium, dont know if its a good/correct solution but it seems to be working.
System.setProperty("webdriver.chrome.driver", "C:\\Users\\PC\\Desktop\\Chromedriver\\chromedriver.exe");
WebDriver driver = new ChromeDriver();
driver.get("http://www.bellator.com/events");
String html = driver.getPageSource();
Document doc = Jsoup.parse(html);
Elements elements = doc.select("ul.s_layouts_lineListAlt > li > a");
for(Element element : elements) {
System.out.println(element.attr("href"));
}
Output:
http://www.bellator.com/events/d306b5/bellator-newcastle-pitbull-vs-scope
http://www.bellator.com/events/ylcu8d/bellator-215-mitrione-vs-kharitonov
http://www.bellator.com/events/yk2djw/bellator-216-mvp-vs-daley
http://www.bellator.com/events/e8rdqs/bellator-217-gallagher-vs-graham
http://www.bellator.com/events/281wxq/bellator-218-sanchez-vs-grimshaw
http://www.bellator.com/events/8lcbdi/bellator-219-koreshkov-vs-larkin
http://www.bellator.com/events/9rqguc/bellator-macdonald-vs-fitch
I am able to display a download link in category to download all the pages of that category.
In the below link, it is written as
In order to include this parser function link automatically to every category page, add it to the Mediawiki:Categoryarticlecount page.
Rather than adding the download link manually to all categories, i tried the above. That is, added the download link in Mediawiki:Categoryarticlecount page to automatically include the fullurl parser function link to every category page. But it didnt work.
Parser function link : [{{fullurl:{{FULLPAGENAME}}|action=pdfbook}} | Download]
How to achieve this?
Any help is appreciated.
You have two typos in there:
The system message is at MediaWiki:Category-article-count (note the camel-case in MediaWiki)
The external link syntax is [url text], not [url | text], so it should be [{{fullurl:{{FULLPAGENAME}}|action=pdfbook}} Download]
Other than that, your code looks fine.
I need to scrape some URLs from some retailer product pages, but the specific URLs I need to get aren't in the html part of the page. The html looks like this for each of the items on which one would click to get to the page with the URL I need to grab:
<div id="name" class="hand bold" onclick="AVON.productcontrol.Go(45714);">ADVANCE TECHNIQUES Color Protection Conditioner Bonus Size</div>
I wrote the following to get URLs from the page, but since the actual URLs I need don’t seem to be stored in the page, it doesn’t get what I need:
def getUrls(URL):
"""input: product page url
output: list of urls to products
"""
connection = urllib.urlopen(URL)
dom = lxml.html.fromstring(connection.read())
selAnchor = CSSSelector('a')
foundElements = selAnchor(dom)
urlList = [e.get('href') for e in foundElements]
return urlList
Is there a way to get the link that the function after ‘onclick’ (I guess AVON.productcontrol.Go(#);) takes you to? I don’t fully understand html, and while I’ve read a bit about onclick, I can’t figure out how the function after 'onclick' works.
In order to find the URL that you are taken to on click, you need to find the JavaScript source code of the 'Go' function and read and understand it. It's buried somewhere within a tag or some JavaScript .js file that is referenced directly or indirectly by the HTML page. Happy digging!
Or: you automate the interaction with the web page with a tool like Selenium (http://docs.seleniumhq.org/) and just check where it takes you if you click.
I'm trying to build a Rainmeter widget that fetches the nearest town to the user and displays it onscreen. I'm currently trying to use the webparser function and this website, but it doesn't seem to be working. The code I've adapted from the example on the Rainmeter website is below - any ideas?
[Rainmeter]
Author=Rainmeter staff
Update=1000
;[WEBSITE MEASURES]===============================
[MeasureWebsite]
Measure=Plugin
Plugin=WebParser
UpdateRate=1800
URL=http://locationdetection.mobi/
RegExp="(?siU)<span style="color:white;">town_city:</span> <b>"(.*)"</b>.*"
[MeasureTown]
Measure=Plugin
Plugin=WebParser
Url=[MeasureWebsite]
StringIndex=1
;[DISPLAY METERS]==================================
[TextStyle]
X=2
Y=17
FontFace=Segoe UI
FontSize=32
FontColor=#454442
StringStyle=Bold
Antialias=1
[MeterTown]
MeasureName=MeasureTown
Meter=String
MeterStyle=TextStyle
Y=2
I'm not sure if it's too late but... your URL doesn't seem to contain any town_city information... at least not in the source code.
Rainmeter reads the raw HTML code from the URL you give to it and processes it through the provided RegEx. Without proper information in the website you can never get it to work.
I have a feed from my Twitter profile on the top of my site but I wondered if there is a way to filter out my #replies and only show my status updates?
Thanks
Maybe with Yahoo Pipes.
Tomalak has made a quick example for you.
If you're using the standard Twitter feed web code for Blogger and similar sites, this bit of Javascript does the trick. It sits between the Twitter feed and the callback and strips replies out of the server response.
For a blog badge, the standard Twitter web code ends with two <script> tags. The first provides the function that displays your tweets. The second queries twitter for the tweets to display.
Add this script to your badge code before the twitter query. It provides a new function called filterCallback which strips #replies from the Twitter response.
<script type="text/javascript">
function filterCallback( twitter_json ) {
var result = [];
for(var index in twitter_json) {
if(twitter_json[index].in_reply_to_user_id == null) {
result[result.length] = twitter_json[index];
}
if( result.length==5 ) break; // Edit this to change the maximum tweets shown
}
twitterCallback2(result); // Pass tweets onto the original callback. Don't change it!
}
</script>
The twitter query itself has a parameter which specifies what function to call when the response comes back. In blogger's case, that function is called 'twitterCallback2' - you can search for it in the web code (look for callback=twitterCallback2). To use the new filter you need to replace the text twittercallback2 with filterCallback. The filter is hard coded to then call twitterCallback2 when it's done.
Note that as this will reduce the number of displayed tweets if some of the repsonses from Twitter are replies, so you have to increase the count parameter in the call to allow for that. The new function then limits the number of displayed replies to five - edit the code to change that.
Here's my blog post about it: Filter Replies out of Twitter Feed
If you want to use the new Twitter widgets, just add this piece of code within the features: setting of the widget's source code:
filters: {
negatives: /\B#\w{1,20}(\s+|$)/
},
I took this one from Dustin Diaz's website at http://www.dustindiaz.com. Dustin Diaz is the creator of the Twitter widget.
Change the setUser call to
setUser('name&exclude_replies=true');
This is kind of a hack but it does the trick
Depends on what you're using to display the entries. If you're using Twitter's widget, then probably not. If you're using some other programmatic way of displaying the items, you'd need to provide more details about what you're doing (language, sample code, etc) and we can probably help with filtering.
You'll probably want to use a regular expression. Something along the lines of:
[a-zA-Z0-9][a-zA-Z0-9]*: #[a-zA-Z0-9][a-zA-Z0-9]*.*
Depending on how you are formatting your twitter feed on your page. This regex assumes that you're formatted something like:
username: #username msg txt
If it matches, don't display it. If it doesn't match, then display it. :) If you've got tags in there along with the text, adjust the regex appropriately.