I'm trying to scrape or obtain the text of Disqus comments from an online local newspaper using RSelenium in Chrome but am finding the going a little tough for my capabilities. I have searched many places but did not find the right information or I am using the wrong search terms (most probably).
So far I have managed to get the "normal" html from the pages but cannot pinpoint the right class, css selector or id to get the Disqus comments. I have also tried Selectorgadget but this only points to #dsq-app2 which selects the whole Disqus area at once and does not allow to select smaller parts of the area. I tried the same with RSelenium using elems <- mybrowser$findElement(using = "id", "dsq-app2") with an "environment" being stored in elems. Then I tried to find child elements within elems but came up blank.
Viewing the page via developer tools I can see that the interesting stuff is within an iframe called #dsq-app2 and have managed to extract all its source through the elems$getPageSource() after switching to the frame using elems$switchToFrame("dsq-app2"). This outputs all the html as one big "dirty" chunk and short of searching for the required stuff held in <p> tags and other elements of interest such as poster's usernames in data-role="username" and others, I don't seem to find the right way forward.
I have also tried using the advice given here but the Disqus setup is a little different. One of the pages I'm trying is this with the bulk of the comments area within a section called conversation and a ton of other id's such as posts and the un-ordered list with the id=post-list that ultimately carries the comments I need to scrape.
Any ideas or help tips are most welcome and received with thanks.
After a lot of testing and experimenting I managed. I don't know if it's the cleanest or prettiest solution but it works. Hope others will find it useful. Basically what I did was to find the url that points to the comments only. This is found within the "dsq-app2" iframe and is an attribute called src. At first I was also switching to the iframe but found that this works without.
remDr$navigate("toTheRequiredPage")
elemsource <- remDr$findElement(using = "id", value = "dsq-app2")
src <- elemsource$getElementAttribute("src") # find the src attribute within the iframe`
remDr$navigate(src[[1]]) # navigate to the src url
# find the posts from the new page
elem <- remDr$findElement(using = "id", value = "posts")
elem.posts <- elem$findChildElements(using = "id", value = "post-list")
elem.msgs <- elem.posts[[1]]$findChildElements(using = "class name", value = "post-message")
length(elem.msgs)
msgtext <- elem.msgs[[1]]$getElementText() # find first post's text
msgtext # print message
Update: I found out that if I use remDr$switchToFrame("dsq-app2") I do not need to use the src url as I have explained above. So there are actually two ways of scraping;
Use switchToFrame("nameOfFrame") or
Use my prior solution of using the src URL from the iframe
Hope this makes it clearer.
Related
I want to extract the pages mentioned in the infobox and templates of pages.
E.g. From this page:
https://en.wikipedia.org/wiki/DNA
I want to extract all of the links in the infobox, like: "Genetics", "Introduction to Genetics" etc.
I want to do it, by using the sql dumps, possibly avoiding to parse the xml of whole pages, and I don't want to do it with APIs.
I could not find a way.
While Pagelinks does include also the links of infoboxes, I cannot find a way to exclude them.
I thought Templatelinks may have that info, but it is not: I could not find the pageids of the corresponding links in infoboxes.
Where is this information stored?
Or which kind of tables should I look at?
I consulted previous questions:
where can I find the infobox templates used in wiki?
and Mediawiki reference:
https://www.mediawiki.org/wiki/Manual:Templatelinks_table#Schema_summary
but could not find a solution.
That is a sidebar rather than an infobox: https://en.wikipedia.org/wiki/Template:Genetics_sidebar
I don't think there's a way of doing it other than parsing the content of the template to extract the links or using the API: e.g. https://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Template:Genetics%20sidebar&pllimit=100&plnamespace=0
Something like this should also work but it's not returning any results for me:
SELECT * from pagelinks
where pl_title = 'Genetics_sidebar'
and pl_namespace = 0
and pl_from_namespace = 10
https://quarry.wmcloud.org/query/71442
I am working on a project where I want to scrape a page like this, in order to get the city of origin. I tried to use the css selector: ".type-12~ .type-12+ .type-12" However I do not get the text into R.
Link:
https://www.kickstarter.com/projects/1141096871/support-ctrl-shft/description
I use rvest and and the read_html function.
However, it seems that the source has some scripts in it. Is there a way to scrape the website after the scripts have returned their results (as you see it with a browser)?
PS I looked at similar questions but did find the answer..
Code:
main.names <- read_html(x = paste0("https://www.kickstarter.com/projects/1141096871/support-ctrl-shft/description")) # feed `main.page` to the next step
names1 <- main.names %>% # feed `main.page` to the next step
html_nodes("div.mb0-md") %>% # get the CSS nodes
html_text()# extract the text
You should not do it. They provide a API which you can find here: https://status.kickstarter.com/api
Using APIs or Ajax/JSON calls is usually better since
The server isn't overused because your scraper visits every link it can find causing unnecessary traffic. That is bad for the speed of your program and bad for the servers of the site you are scraping.
You don't have to worry about that they changed a class name or id and your code won't work anymore
Especially the second part should interest you since it can take hours finding which class isn't returning a value anymore.
But to answer your question:
When you use the right scraper you can find all what you want. What tools are you using? There are possibilities to get data before the site is loaded or after. You can execute the JS on the site separately and find hidden content or find things like display:none Css classes...
It really depends on what you are using and how you use it.
I have a link that contents a table. First thing I tried was finding if there is any button to click and unfortunately there isn't. Then I tried to use a package called XML in R to fetch the data between different nodes to build up a data frame by myself.
In order to do this I need to know which node (or HTML tag) I would like to extracting. So I right click on the web browser and find the tag that contains the table I want.
From <fieldset id="result" starts the content of the table. We can also see from the browser the first row of the table is <li class="vesselResultEntry removeBackground">.
Then when I was trying to use R to download this HTML, I found the whole <li> tags that relating to the table are gone and replaced by <li class="toRemove"/>. Here is my R code below by the way:
library(XML)
url <- "http://www.fao.org/figis/vrmf/finder/search/#stats"
webpage <- readLines(url)
htmlpage <- htmlParse(webpage, asText = TRUE)
data <- xpathSApply(htmlpage, "//ul[#id='searchResultsContainer']")
data
# <ul id="searchResultsContainer" class="clean resultsContainer"><li class="toRemove"></li></ul>
What I'm trying to do in the code is simply to see if I can fetch the content in a specific tag. Clearly the row I want to fetch is not in the object (webpage)I saved.
So my questions are:
Is there a way to download the table I want by any means (Ideally in R)?
Is there some kind of protection in this website that prevents me from downloading the whole HTML as a text file and fetch data?
Much appreciate for any suggestions
The page you're trying to fetch is assembled dynamically on the browser side on load. The content you get by directly fetching the url does not contain the data you see when you view the DOM. That data is loaded later from a separate URL.
I took a look and the URL in question is:
http://www.fao.org/figis/vrmf/finder/services/public/vessels/search?c=true&gd=true&nof=false¬=false&nol=false&ps=30&o=0&user=NOT_SET
I'm not sure what most of the query string is, but it's clear that ps is "page size" and o is "offset". Page size seems to cap at 200 above which it is forced to 30. The URL returns JSON so you'll need some way to parse that. The data embedded in the responses says there are 231047 entries so you'll have to make multiple requests to get it all.
Data providers usually do not appreciate people scraping their data like that. You might want to look around for a downloadable version.
I've seen various SO threads about FB open graph image tags such as this: Facebook multiple og:image tags - Which is Default?
These threads are 2.5 years old so I'm wondering if the rules have been updated. Also, the accepted answer of the highest resolution image being the one displayed seems imperfect. What if the images aren't the same? For example, how to have one image for the homepage and then a different one on AJAX loaded pages?
As these rules are used by FB, Reddit and many high-traffic sites obviously this information is very valuable. Thanks!
Since reddit is open-source, you can look at see what its behavior is.
The place you want to look is _find_thumbnail_image(). Right now, this is the code that pertains to Open Graph:
# Allow the content author to specify the thumbnail using the Open
# Graph protocol: http://ogp.me/
og_image = (soup.find('meta', property='og:image') or
soup.find('meta', attrs={'name': 'og:image'}))
if og_image and og_image['content']:
return og_image['content'], None
og_image = (soup.find('meta', property='og:image:url') or
soup.find('meta', attrs={'name': 'og:image:url'}))
if og_image and og_image['content']:
return og_image['content'], None
So, it'll use whatever Beautiful Soup's find() method returns, which should be the first matching tag.
hi so I need to retrieve the url for the first article on a term I search up on nytimes.com
So if I search for Apple. This link would return the result
http://query.nytimes.com/search/sitesearch?query=Apple&srchst=cse
And you just replace Apple with the term you are searching for.
If you click on that link you would see that NYtimes ask you if you mean Apple Inc.
I want to get the url for this link, and go to it.
Then you will just get a lot of information on Apple Inc.
If you scroll down you will see the articles related to Apple.
So what I ultimately want is the URL of the first article on this page.
So I really do not know how to go about this. Do I use Java, or what do I use? Any help would be greatly appreciated and I would put a bounty on this later, but I need the answer ASAP.
Thanks
EDIT: Can we do this in Java?
You can use Python with the standard urllib module to fetch the pages and the great HTML parser BeautifulSoup to obtain the information you need from the pages.
From the documentation of BeautifulSoup, here's sample code that fetches a web page and extracts some info from it:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
where, linebreak, what = incident.contents[:3]
print where.strip()
print what.strip()
print
This this is a nice and detailed article on the topic.
You certainly can do it in Java. Look at the HttpURLConnection class. Basically, you give it a URL, call the connect function, and you get back an input stream with the contents of the page, i.e. HTML text. You can then process that and parse out whatever information you want.
You're facing two challenges in the project you are describing. The first, and probably really the lesser challenge, is figuring out the mechanics of how to connect to a web page and get hold of the text within your program. The second and probably bigger challenge will be to figure out exactly how to extract the information you want from that text. I'm not clear on the details of your requirements, but you're going to have to sort through a ton of text to find what you're looking for. Without actually looking at the NY Times site at the momemnt, I'm sure it has all sorts of decorations like pretty pictures and the company logo and headlines and so on, and then there are going to be menus and advertisements and all sorts of stuff. I sincerely doubt that the NY Times or almost any other commercial web site is going to return a search page that includes nothing but a link to the article you are interested in. Somehow your program will have to figure out that the first link is to the "subscribe on line" page, the second is to an advertisement, the third is to customer service, the fourth and fifth are additional advertisements, the sixth is to the home page, etc etc until you finally get to the one you're actually interested in. How will you identify the interesting link? There are probably headings or formatting that make it recognizable to a human being, but you use a lot of intuition to screen out the clutter that can be difficult to reproduce in a program.
Good luck!
You can do this in C# using the HTML Agility Pack, or using LINQ to XML if the site is valid XHTML. EDIT: It isn't valid XHTML; I checked.
The following (tested) code will get the URL of the first search result:
var doc = new HtmlWeb().Load(#"http://query.nytimes.com/search/sitesearch?query=Apple&srchst=cse");
var url = HtmlEntity.DeEntitize(doc.DocumentNode.Descendants("ul")
.First(ul => ul.Attributes["class"] != null
&& ul.Attributes["class"].Value == "results")
.Descendants("a")
.First()
.Attributes["href"].Value);
Note that if their website changes, this code might stop working.