I've been playing around with scraping webpages using BeautifulSoup for a few weeks now. An issue I recently ran into, and hadn't seen before is where the content of the webpage is different from what's show as the page's source code and what's given in the url request response.
For example, let's look at yelp. This (http://www.yelp.com/search?find_desc=&find_loc=Pittsburgh%2C+PA%2C+USA&ns=1) will bring up all 63k businesses in the Pittsburgh, PA area. If we look at the pages source, we see that it matches the content (if you search for the word showing it finds the code below.)
<span class="pagination-results-window">
Showing 1-10 of 63936
</span>
Now, let's only look at restaurants in the Pittsburgh, PA area. This reduces the number of returned results from 63k to 5k. However, if we look at the pages source, the same code shown above is seen. Moreover, the first returned result in the page source matches the 63k page, not the 5k page. At first, I thought this might be due to mozilla caching webpage content but quickly nixed this idea by scraping the link for the 5k restaurants (http://www.yelp.com/search?find_desc=&find_loc=Pittsburgh%2C+PA%2C+USA&ns=1#cflt=restaurants). The result showed that it collected html that generated the page with 63k businesses, not the 5k restaurants that I was expecting.
My question is what is causing this? Is this done intentially by Yelp or this caused by an external reason? I've tried looking this up on my own but I'm unable to find anything that explains this using the verbiage in this question's title.
Let me know if you need more details, I'm happy to provide the few more lines of code that I left out.
Thanks!
Yelp, like many responsive sites, uses AJAX to fetch more data and/or jQuery to perform filtering. Scraping can only pull the base HTML before any jQuery or AJAX updates are performed.
Both of these URLs are most likely the same to server-side code:
search?find_desc=&find_loc=Pittsburgh%2C+PA%2C+USA&ns=1
search?find_desc=&find_loc=Pittsburgh%2C+PA%2C+USA&ns=1#cflt=restaurants
That is why you see the same scraped results in both cases. However, the fragment #cflt=restaurants is used by client-side JavaScript and kicks off some script to filter results.
Related
I'm creating a VCL Application with Delpi 10.3 and want to support some web functionality by having the user enter the ISBN of a book into a TEdit component and from there passing/sending this value to a search field on this website: https://isbnsearch.org after which the website looks up the ISBN and displays the Author of the book. I want to somehow access the information (i.e Author) presented by the search result and again use it in my application.
This is my GUI, for a better idea of what I want to accomplish:
What code can I use for this? Any other feasible suggestions or approaches are acceptable.
When performing a search on that website, it simply loads a page with a specific URL query string...
https://isbnsearch.org/search?s=suess
The above example is when I search for "suess", so you can easily concatenate a search URL.
You can use any HTTP component, such as TIdHTTP, to load this search page, then use an HTML parser to scrape the page and read what you need. Much, much easier than trying to read through the TWebBrowser.
In the end, you won't actually display the HTML (I mean you can if you want to), but the idea is to read the data and display it in your own format.
On that specific page, start by locating the ul element with id searchresults. Then, each li element contains individual results. Unfortunately, this website uses pagination, and only shows 10 results per page. To do this, call this page again with another parameter &p=2 for the 2nd page, &p=3 for the 3rd page, and so on.
On the other hand, that is the worst way to acquire such information. What you should be doing is using a proper API which gives you machine-friendly data. The service you are referencing doesn't appear to have an option, but here's an example of one which does:
https://openlibrary.org/dev/docs/api/books - this also appears to provide you MUCH more information than the one you're using.
I am working on a project where I want to scrape a page like this, in order to get the city of origin. I tried to use the css selector: ".type-12~ .type-12+ .type-12" However I do not get the text into R.
Link:
https://www.kickstarter.com/projects/1141096871/support-ctrl-shft/description
I use rvest and and the read_html function.
However, it seems that the source has some scripts in it. Is there a way to scrape the website after the scripts have returned their results (as you see it with a browser)?
PS I looked at similar questions but did find the answer..
Code:
main.names <- read_html(x = paste0("https://www.kickstarter.com/projects/1141096871/support-ctrl-shft/description")) # feed `main.page` to the next step
names1 <- main.names %>% # feed `main.page` to the next step
html_nodes("div.mb0-md") %>% # get the CSS nodes
html_text()# extract the text
You should not do it. They provide a API which you can find here: https://status.kickstarter.com/api
Using APIs or Ajax/JSON calls is usually better since
The server isn't overused because your scraper visits every link it can find causing unnecessary traffic. That is bad for the speed of your program and bad for the servers of the site you are scraping.
You don't have to worry about that they changed a class name or id and your code won't work anymore
Especially the second part should interest you since it can take hours finding which class isn't returning a value anymore.
But to answer your question:
When you use the right scraper you can find all what you want. What tools are you using? There are possibilities to get data before the site is loaded or after. You can execute the JS on the site separately and find hidden content or find things like display:none Css classes...
It really depends on what you are using and how you use it.
When extracting data you can use CSS/xpaths. But is there a similar or reliable method of doing this in the page source.
www.amazon.com/Best-Sellers-Electronics-Televisions/zgbs/electronics/172659
You could get the page source and then parse using Regex but probably not be reliable if for instance the tv did not load on the page. I have looked up various solutions but I have yet to find one that mentions getting every tv at start of each line (1, 4, 7 etc,, in source) or using a reliable method e.g Css/xpaths in source of a page.
What would is the golden standard of reliable method of doing what I am after?
To get the page source you can use CURL if the page is rendered entirely on server side (most pages won't be), or headless chrome to get the actual DOM that will render in the browser (https://developers.google.com/web/updates/2017/04/headless-chrome).
For scraping the content, I've used cheerio (https://github.com/cheeriojs/cheerio) which will allow you to read in HTML to an object and then scrape your data off that using jQuery expressions. (Headless chrome allows you to execute JS on the pages you visit, so you don't necessarily need cheerio).
In your specific example you could get the TV on each line by combining the right class selectors to get the divs containing TV's, and using attribute selector with 'margin-left=0px' which would get first item on each line. That is obviously very much bound to structure of the page and will likely be broken by smallest of changes in the page source. (And not really any different from using xpaths. Still better than regex though)
With certain elements loading / not loading on the page (if that was what you meant by TV not being there), no golden solutions that I know of, except allowing sufficient time for the page to load and handling your scraper failing gracefully.
Short version: How do I know how to phrase additional data (like specific options on the page that display different html files but belong to the same URL) when getting an URL with urllib?
Long version:
I am having trouble to figure out how to handle properties of an url request that are not determined by the Link URL but by probably other information that your browser is usually sending.
To be more precise:
This page contains a table that i want to read with python, but the length of the table depends on the number of items per page you choose in the bottom left (i.e. the number of items in the code I get from urllib.request.urlopen is the standard of 50 or something, not the complete table). Clicking the buttons for e.g. 400 items per page doesn't change the URL so I expect that there is some information sent somewhere else. I understand that using urllib can send additional data besides the url, but it is unclear to me how to figure out how I should phrase the "give me the whole table" (or "give me 400 items per page") in that data.
Studying the .html file I get from saving the webpage in my browser didn't give me any hints and I miss the vocabulary to search for answers on the web (that is, googling "urllib request parameter" is too vague).
Hence I'd be completely satisfied if someone would point me to a duplicate of this question.
Thanks in advance :)
For everyone else finding this question I'll elaborate on the answer #deceze gave in the comments:
Open the webpage you want to read in your browser
Open your Browsers network panel (in chromium this is [Strg+Shift+I] or right-click > Inspect
Go to the "Network" Tab (at least in chromium)
Do whatever you want your program to do and the empty network panel list will fill with a lot of data
Find your request in the list of events (one of the very first ones is right, I would guess), click it and select "Headers"
hi so I need to retrieve the url for the first article on a term I search up on nytimes.com
So if I search for Apple. This link would return the result
http://query.nytimes.com/search/sitesearch?query=Apple&srchst=cse
And you just replace Apple with the term you are searching for.
If you click on that link you would see that NYtimes ask you if you mean Apple Inc.
I want to get the url for this link, and go to it.
Then you will just get a lot of information on Apple Inc.
If you scroll down you will see the articles related to Apple.
So what I ultimately want is the URL of the first article on this page.
So I really do not know how to go about this. Do I use Java, or what do I use? Any help would be greatly appreciated and I would put a bounty on this later, but I need the answer ASAP.
Thanks
EDIT: Can we do this in Java?
You can use Python with the standard urllib module to fetch the pages and the great HTML parser BeautifulSoup to obtain the information you need from the pages.
From the documentation of BeautifulSoup, here's sample code that fetches a web page and extracts some info from it:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
where, linebreak, what = incident.contents[:3]
print where.strip()
print what.strip()
print
This this is a nice and detailed article on the topic.
You certainly can do it in Java. Look at the HttpURLConnection class. Basically, you give it a URL, call the connect function, and you get back an input stream with the contents of the page, i.e. HTML text. You can then process that and parse out whatever information you want.
You're facing two challenges in the project you are describing. The first, and probably really the lesser challenge, is figuring out the mechanics of how to connect to a web page and get hold of the text within your program. The second and probably bigger challenge will be to figure out exactly how to extract the information you want from that text. I'm not clear on the details of your requirements, but you're going to have to sort through a ton of text to find what you're looking for. Without actually looking at the NY Times site at the momemnt, I'm sure it has all sorts of decorations like pretty pictures and the company logo and headlines and so on, and then there are going to be menus and advertisements and all sorts of stuff. I sincerely doubt that the NY Times or almost any other commercial web site is going to return a search page that includes nothing but a link to the article you are interested in. Somehow your program will have to figure out that the first link is to the "subscribe on line" page, the second is to an advertisement, the third is to customer service, the fourth and fifth are additional advertisements, the sixth is to the home page, etc etc until you finally get to the one you're actually interested in. How will you identify the interesting link? There are probably headings or formatting that make it recognizable to a human being, but you use a lot of intuition to screen out the clutter that can be difficult to reproduce in a program.
Good luck!
You can do this in C# using the HTML Agility Pack, or using LINQ to XML if the site is valid XHTML. EDIT: It isn't valid XHTML; I checked.
The following (tested) code will get the URL of the first search result:
var doc = new HtmlWeb().Load(#"http://query.nytimes.com/search/sitesearch?query=Apple&srchst=cse");
var url = HtmlEntity.DeEntitize(doc.DocumentNode.Descendants("ul")
.First(ul => ul.Attributes["class"] != null
&& ul.Attributes["class"].Value == "results")
.Descendants("a")
.First()
.Attributes["href"].Value);
Note that if their website changes, this code might stop working.