retrieving URLs from functions within HTML (python) - html

I need to scrape some URLs from some retailer product pages, but the specific URLs I need to get aren't in the html part of the page. The html looks like this for each of the items on which one would click to get to the page with the URL I need to grab:
<div id="name" class="hand bold" onclick="AVON.productcontrol.Go(45714);">ADVANCE TECHNIQUES Color Protection Conditioner Bonus Size</div>
I wrote the following to get URLs from the page, but since the actual URLs I need don’t seem to be stored in the page, it doesn’t get what I need:
def getUrls(URL):
"""input: product page url
output: list of urls to products
"""
connection = urllib.urlopen(URL)
dom = lxml.html.fromstring(connection.read())
selAnchor = CSSSelector('a')
foundElements = selAnchor(dom)
urlList = [e.get('href') for e in foundElements]
return urlList
Is there a way to get the link that the function after ‘onclick’ (I guess AVON.productcontrol.Go(#);) takes you to? I don’t fully understand html, and while I’ve read a bit about onclick, I can’t figure out how the function after 'onclick' works.

In order to find the URL that you are taken to on click, you need to find the JavaScript source code of the 'Go' function and read and understand it. It's buried somewhere within a tag or some JavaScript .js file that is referenced directly or indirectly by the HTML page. Happy digging!
Or: you automate the interaction with the web page with a tool like Selenium (http://docs.seleniumhq.org/) and just check where it takes you if you click.

Related

How can I differentiate between two url requests from different HTML pages but with the same namespace in views.py?

I am creating a simple eBay like e-commerce website to get introduced with django. For removing an item from the watchlist, I placed two same links in two different HTML files, that is, I can either remove the item from the watchlist.html page or either from the item's page which was saved as listing.html. The url for both the pages look like this:
Remove from watchlist
Now, in my views.py, I want to render different pages on the basis of the request. For example, if someone clicked Remove from watchlist from listing.html then the link should redirect again to listing.html and same goes for the watchlist.html.
I tried using request.resolver_match.view_name but this gave me 'removeFromWatchlist' as the url namespace for both of these request is same.
Is there any way I can render two different HTML pages based on the origin of the url request?
Also, this is my second question here so apologies for incorrect or bad formatting.
You could check the HTTP_REFERER in the request.META attribute of the view to get the url that referred the request as so:
from django.shortcuts import redirect
def myview(request):
...
return redirect(request.META.get("HTTP_REFERER"))#Or however you prefer redirecting
https://docs.djangoproject.com/en/3.1/ref/request-response/#django.http.HttpRequest.META

Scraping prices with BeautifulSoup4 in Python3

I am new scraping with Python and BeautifulSoup4. Also, I do not have knowledge of HTML. To practice, I am trying to use it on Carrefour website to extract the price and price per kilogram of the product that I search for EAN code.
My code:
barcodes = ['5449000000996']
for barcode in barcodes:
url = 'https://www.carrefour.es/?q=' + barcode
html = requests.get(url).content
bs = BeautifulSoup(html, 'lxml')
searchingprice = bs.find_all('strong', {'class':'ebx-result-price__value'})
print(searchingprice)
searchingpricerperkg = bs.find_all('span', {'class':'ebx-result__quantity ebx-result-quantity'})
print(searchingpricerperkg)
But I do not get any result at all
Here is a screenshot of the HTML code:
What am I doing wrong? I tried with other website and it seems to work
The problem here is that you're scraping a page with Javascript-generated content. Basically, the page that you're grabbing with requests actually doesn't have the thing you're grabbing from it - it has a bunch of javascript. When your browser goes to the page, it runs the javascript, which generates the content - so the page you see in the rendered version in your browser is not the same thing returned from the actual page itself. The page contains instructions for your browser to write the page that you see.
If you're just practicing, you might want to simply try a different source to scrape from, but to scrape from this page, you'll need to look into other solutions that can handle javascript generated content:
Web-scraping JavaScript page with Python
Alternatively, the javascript generates content by requesting data from other sources. I don't speak spanish, so I'm not much help in figuring this part out, but you might be able to.
As an exercise, go ahead and have BS4 prettify and print out the page that it receives. You'll see that within that page there are requests to other locations to get the info you're asking for. You might be able to change your request to not go to the page where you view the info, but to the location that page gets it's data from.

PdfBook Extension - MediaWiki - Fullurl parser function link automatically to every category page

I am able to display a download link in category to download all the pages of that category.
In the below link, it is written as
In order to include this parser function link automatically to every category page, add it to the Mediawiki:Categoryarticlecount page.
Rather than adding the download link manually to all categories, i tried the above. That is, added the download link in Mediawiki:Categoryarticlecount page to automatically include the fullurl parser function link to every category page. But it didnt work.
Parser function link : [{{fullurl:{{FULLPAGENAME}}|action=pdfbook}} | Download]
How to achieve this?
Any help is appreciated.
You have two typos in there:
The system message is at MediaWiki:Category-article-count (note the camel-case in MediaWiki)
The external link syntax is [url text], not [url | text], so it should be [{{fullurl:{{FULLPAGENAME}}|action=pdfbook}} Download]
Other than that, your code looks fine.

HTML injection into someone else's website?

I've got a product that embeds into websites similarly to Paypal (customers add my button to their website, users click on this button and once the service is complete I redirect them back to the original website).
I'd like to demo my technology to customers without actually modifying their live website. To that end, is it possible to configure http://stackoverflow.myserver.com/ so it mirrors http://www.stackoverflow.com/ while seamlessly injecting my button?
Meaning, I want to demo the experience of using my button on the live website without actually re-hosting the customer's database on my server.
I know there are security concerns here, so feel free to mention them so long as we meet the requirements. I do not need to demo this for website that uses HTTPS.
More specifically, I would like to demonstrate the idea of financial bounties on Stackoverflow questions by injecting a Paypal button into the page. How would I demo this off http://stackoverflow.myserver.com/ without modifying https://stackoverflow.com/?
REQUEST TO REOPEN: I have reworded the question to be more specific per your request. If you still believe it is too broad, please help me understand your reasoning by posting a comment below.
UPDATE: I posted a follow-up challenge at How to rewrite URLs referenced by Javascript code?
UPDATE2: I discarded the idea of bookmarklets and Greasemonkey because they require customer-side installation/modification. We need to make the process as seamless as possible, otherwise many of get turned off by the process and won't let us pitch.
I would suggest to create a proxy using a HTTP handler.
In the ProcessRequest you can do a HttpWebRequest to get the content on the other side, alter it and return the adjusted html to the browser. You can rewrite the urls inside to allow the loading of images, etc from the original source.
public void ProcessRequest(HttpContext context)
{
// get the content using HttpWebRequest
string html = ...
// alter it
// write back the adjusted html
context.Response.Write(html);
}
If you're demoing on the client-side and looking to just hack it in quickly, you could pull it off with some jQuery. I slapped the button after the SO logo just for a demo. You could type this into your console:
$('head').append('<script src="https://www.paypalobjects.com/js/external/dg.js" type="text/javascript"></script>')
$('#hlogo').append('<form action="https://www.sandbox.paypal.com/webapps/adaptivepayment/flow/pay" target="PPDGFrame" class="standard"><label for="buy">Buy Now:</label><input type="image" id="submitBtn" value="Pay with PayPal" src="https://www.paypalobjects.com/en_US/i/btn/btn_paynowCC_LG.gif"><input id="type" type="hidden" name="expType" value="light"><input id="paykey" type="hidden" name="paykey" value="insert_pay_key">')
var embeddedPPFlow = new PAYPAL.apps.DGFlow({trigger: 'submitBtn'});
Now, I'm not sure if I did something wrong or not because I got this error on the last part:
Expected 'none' or URL but found 'alpha('. Error in parsing value for 'filter'. Declaration dropped.
But at any rate if you are demoing you could just do this, maybe as a plan B. (You could also write a userscript for this so you don't have to open the console, I guess?)
After playing with this for a very long time I ended up doing the following:
Rewrite the HTML and JS files on the fly. All other resources are hosted by the original website.
For HTML files, inject a <base> tag, pointing to the website being redirected. This will cause the browser to automatically redirect relative links (in the HTML file, CSS files, and even Flash!) to the original website.
For the JS files, apply a regular expression to patch specific sections of code that point to the wrote URL. I load up the redirected page in a browser, look for broken links, and figure out which section of JS needs to be patched to correct the problem.
This sounds a lot harder than it actually is. On average, patching each page takes less than 5 minutes of work.
The big discovery was the <base> tag! It corrected the vast majority of links on my behalf.

Get innerHTML via Jsoup

Im trying to scrape data from this website: http://www.bundesliga.de/de/liga/tabelle/
In the source code i can see the tables but there's no content, just things like:
<td>[no content]</td>
<td>[no content]</td>
<td>[no content]</td>
<td>[no content]</td>
....
With firebug (F12 in Firefox) i wont see any content too but i can select the table and then copy the innerHTML via firebug option. In that case i get all the informations about the teams, but i dont know how to get the table with the content in Jsoup.
To get the value of an attribute, use the Node.attr(String key) method
For the text on an element (and its combined children), use Element.text()
For HTML, use Element.html(), or Node.outerHtml() as appropriate
For example:
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""
String linkOuterH = link.outerHtml();
// "<b>example</b>"
String linkInnerH = link.html(); // "<b>example</b>"
reference:
http://jsoup.org/cookbook/extracting-data/attributes-text-html
The table is not rendered on the server directly, but build by the client side JavaScript of the page and constructed with data that is getting to the client via AJAX. So what you get with the naive Jsoup approach is expected.
I see two possible solutions:
You analyze the network traffic and identify the ajax calls that the site is making. Then you try to reconstruct the format and fire the same requests as the JavaScript would. Then you can reconstruct the table.
you don't use Jsoup but a real browser, that loads the page and runs the JavaScript including all AJAX calls. You could use Selenium webdriver for that. There is a headless browser called phantomjs which has a relatively small footprint that you can use in combination with selenium webdriver.
both options have their (dis)advantages:
This takes more time, since you need to understand the network traffic pretty good. The reward will be a very fast and memory efficient scraper.
The programming of selenium is very easy and you should not have any difficulties achieving your goal. You don't need to understand the inner workings of the site you want to scrape. However, the price is a further dependency in your project. Memory consumption is high. Another process runs. The scraping will be slow.
Maybe you find another source with the soccer table that is holding the infos you want? That might be the easiest. For example http://www.fussballdaten.de/bundesliga/