Groovy Project (html parsing, file downloading, file creating) - html

I am considering starting a project so that I can learn more and keep the things I have learned thus far from getting rusty.
A lot of the project will be new things so I thought I would come here and ask for advice on what to do and how to go about doing it.
I enjoy photoshop and toying around with it, so I thought I would mix my project with something like that. So I decided my program will do something along the lines of grab new resources for photoshop put them in their own folder on my computer. (from deviantart for now)
For now I want to focus on a page like this:
http://browse.deviantart.com/resources/applications/psbrushes/?order=9
I'm not fluent with understanding exactly what is going on in the html source so it is a bit hard to see what is going on.
But lets say I am on that page and I have the following options chosen:
Sorted by Popular
Sorted by All Time
Sorted by 24 Items Per Page
My goal is to individually go to each thumbnail and grab the following:
The Author
The Title
The Description
Download the File (create folder based on title name)
Download the Image (place in folder with the file above)
Create text file with the author, title, and description in it
I would like to do that for each of the 24 items on the page and then go to the next page and do the same. (I am thinking of just going through the first five pages as I don't have too much interest in trying out brushes that aren't too popular)
So, I'm posting this for a sense of direction and perhaps some help on how to parse such a page to get what I'm looking for. I'm sure this project will keep me busy for awhile, but I'm hoping it will become useful in teaching me things.
Any help and suggestions are always appreciated.
.
.
EDIT
Each page is made up of 24 of these:
<div class="tt-a" usericon="http://a.deviantart.net/avatars/s/h/shad0w-gfx.gif" collect_rid="1:19982524">
<span class="shad0w" style="background-image: url ("http://sh.deviantart.net/shad0w/x/107/150/logo3.png");">
<a class="t" title="Shad0ws Blood Brush Set by ~Shad0w-GFX, Jun 28, 2005" href="http://Shad0w-GFX.deviantart.com/art/Shad0ws-Blood-Brush-Set-19982524?q=boost%3Apopular+in%3Aresources%2Fapplications%2Fpsbrushes&qo-0">Shad0ws Blood Brush Set</a>
My assumption is, I want to grab all my information from the:
<a class="t" ... >
Since it contains the title, author, and link to where the download url and large image is located.
If this sounds correct, how would one go about getting that info for each object on the page. (24 per page) I would assume by using CyberNeko. I'm just not exactly sure how to get to the proper level where is located and for each of them on the page
.
.
EDIT #2
I have some test code that looks like this:
divs = []
client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false
page = client.getPage("http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0")
divs = page.getByXPath("//html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a[#class='t']")
divs.each { println it }
The XPath is correct, but it prints out:
<?xml version="1.0" encoding="UTF-8"?><a href="http://Shad0w-GFX.deviantart.com/
art/Shad0ws-Blood-Brush-Set-19982524?q=boost%3Apopular+in%3Aresources%2Fapplicat
ions%2Fpsbrushes&qo=0" class="t" title="Shad0ws Blood Brush Set by ~Shad0w-G
FX, Jun 28, 2005">Shad0ws Blood Brush Set
Can you explain what I need to do to just get the href out of there? Is there a simple way to do it with HtmlUnit?

Meeting the requirements you've listed above is actually pretty easy. You can probably do it with a simple Groovy script of about 50 lines. Here's how I would go about it:
The URL of the first page is
http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0
To get the next page, simply increase the value of the offset parameter by 24:
http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=24
So now you know how to construct the URLs for the pages you need to work with. To download the content of this page use:
def pageUrl = 'http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0'
// get the content as a byte array
byte[] pageContent = new URL(pageUrl).bytes
// or get the content as a String
String pageContentAsString = new URL(pageUrl).text
Now all you need to do is parse out the elements of the content that you're interested in as save it in files. For the parsing, you should use a HTML parser like CyberNeko or Jericho.

Related

Scraping prices with BeautifulSoup4 in Python3

I am new scraping with Python and BeautifulSoup4. Also, I do not have knowledge of HTML. To practice, I am trying to use it on Carrefour website to extract the price and price per kilogram of the product that I search for EAN code.
My code:
barcodes = ['5449000000996']
for barcode in barcodes:
url = 'https://www.carrefour.es/?q=' + barcode
html = requests.get(url).content
bs = BeautifulSoup(html, 'lxml')
searchingprice = bs.find_all('strong', {'class':'ebx-result-price__value'})
print(searchingprice)
searchingpricerperkg = bs.find_all('span', {'class':'ebx-result__quantity ebx-result-quantity'})
print(searchingpricerperkg)
But I do not get any result at all
Here is a screenshot of the HTML code:
What am I doing wrong? I tried with other website and it seems to work
The problem here is that you're scraping a page with Javascript-generated content. Basically, the page that you're grabbing with requests actually doesn't have the thing you're grabbing from it - it has a bunch of javascript. When your browser goes to the page, it runs the javascript, which generates the content - so the page you see in the rendered version in your browser is not the same thing returned from the actual page itself. The page contains instructions for your browser to write the page that you see.
If you're just practicing, you might want to simply try a different source to scrape from, but to scrape from this page, you'll need to look into other solutions that can handle javascript generated content:
Web-scraping JavaScript page with Python
Alternatively, the javascript generates content by requesting data from other sources. I don't speak spanish, so I'm not much help in figuring this part out, but you might be able to.
As an exercise, go ahead and have BS4 prettify and print out the page that it receives. You'll see that within that page there are requests to other locations to get the info you're asking for. You might be able to change your request to not go to the page where you view the info, but to the location that page gets it's data from.

Using R-selenium to scrape data from an aspx webpage

I am pretty new to r and selenium so hopefully i can express myself clearly about my question.
I want to scrape some data off a website (.aspx) and i need to type some chemical code to be able to pull out some information in the next page (using R-selenium to input and click element). So far i have been able to build a short code that will get me through the first step, i.e. pull out the correct page i wanted. But i had so much trouble in finding a good way to scrape the data (the chemical information in the table) off this website. Mainly because the website will not assign a new html address instead of give me the same aspx address for any chemical i search. I plan to overcome this and then build a loop so i can scrape more information automatically. Anyone has any good thoughts that how i should get the data off after click-element? I need the chemical information table in the second page.
Thanks heaps in advance!
Here i put my code that i wrote so far: the next step i need is to scrape the table out the next page!
library("RSelenium")
checkForServer()
startServer()
mybrowser <- remoteDriver()
mybrowser$open()
mybrowser$navigate("http://limitvalue.ifa.dguv.de/")
mybrowser$findElement(using = 'css selector', "#Tbox_cas")
wxbox <- mybrowser$findElement(using = 'css selector', "#Tbox_cas")
wxbox$sendKeysToElement(list("64-19-7"))
wxbutton <- mybrowser$findElement(using = 'css selector', "#Butsearch")
wxbutton$clickElement()
First of all, your tool choice is wrong.
Secondly, in your case
POST to the "permanent" url
302 redirect to a new url, which is http://limitvalue.ifa.dguv.de/WebForm_ueliste2.aspx in your case
GET the new url
Thirdly, what's the ultimate output you are after?
It really depends on how much data you are up to. Otherwise do a manual task.

retrieving URLs from functions within HTML (python)

I need to scrape some URLs from some retailer product pages, but the specific URLs I need to get aren't in the html part of the page. The html looks like this for each of the items on which one would click to get to the page with the URL I need to grab:
<div id="name" class="hand bold" onclick="AVON.productcontrol.Go(45714);">ADVANCE TECHNIQUES Color Protection Conditioner Bonus Size</div>
I wrote the following to get URLs from the page, but since the actual URLs I need don’t seem to be stored in the page, it doesn’t get what I need:
def getUrls(URL):
"""input: product page url
output: list of urls to products
"""
connection = urllib.urlopen(URL)
dom = lxml.html.fromstring(connection.read())
selAnchor = CSSSelector('a')
foundElements = selAnchor(dom)
urlList = [e.get('href') for e in foundElements]
return urlList
Is there a way to get the link that the function after ‘onclick’ (I guess AVON.productcontrol.Go(#);) takes you to? I don’t fully understand html, and while I’ve read a bit about onclick, I can’t figure out how the function after 'onclick' works.
In order to find the URL that you are taken to on click, you need to find the JavaScript source code of the 'Go' function and read and understand it. It's buried somewhere within a tag or some JavaScript .js file that is referenced directly or indirectly by the HTML page. Happy digging!
Or: you automate the interaction with the web page with a tool like Selenium (http://docs.seleniumhq.org/) and just check where it takes you if you click.

Can I make a Rainmeter widget that fetches the nearest/town or city via HTML5 geolocation?

I'm trying to build a Rainmeter widget that fetches the nearest town to the user and displays it onscreen. I'm currently trying to use the webparser function and this website, but it doesn't seem to be working. The code I've adapted from the example on the Rainmeter website is below - any ideas?
[Rainmeter]
Author=Rainmeter staff
Update=1000
;[WEBSITE MEASURES]===============================
[MeasureWebsite]
Measure=Plugin
Plugin=WebParser
UpdateRate=1800
URL=http://locationdetection.mobi/
RegExp="(?siU)<span style="color:white;">town_city:</span> <b>"(.*)"</b>.*"
[MeasureTown]
Measure=Plugin
Plugin=WebParser
Url=[MeasureWebsite]
StringIndex=1
;[DISPLAY METERS]==================================
[TextStyle]
X=2
Y=17
FontFace=Segoe UI
FontSize=32
FontColor=#454442
StringStyle=Bold
Antialias=1
[MeterTown]
MeasureName=MeasureTown
Meter=String
MeterStyle=TextStyle
Y=2
I'm not sure if it's too late but... your URL doesn't seem to contain any town_city information... at least not in the source code.
Rainmeter reads the raw HTML code from the URL you give to it and processes it through the provided RegEx. Without proper information in the website you can never get it to work.

Is there an 'easy' way to split an enormous HTML file into multiple pages

I was given an html log file that has ~306K lines. I know there are better formats for this but would like to be able to view the file online. I think breaking the file up into smaller bits and "paging" is probably the way to go of any way other than manually doing the following:
take header of initial page and copy to every new file
copy 5-10k lines of the initial file and paste into a new body
copy the footer of the initial file and copy to every new file
Then give basic naming conventions of 1.html, 2.html, 3.html and create sublinks on every page in the new footer. Is there an automated way to do this?
You can use something like this
text = document.getElementById('text').value; var pieces = new Array(); var total = Math.ceil(text.length/10000); for(i=0;i<total;i++){
pieces[i] = text.substr(([i]*10000),10000)); }
then send it to a file(the way you want, ajax, or some write-txt)
if your html file is properly formatted (i.e with line-breaks) you can take out the header and footer and split the content every let's say 1000 lines (or any amount that guarantees meaningful data). that sound's doable with a shell script.