Grab images from feedzilla - html

I am able to get the news data like post title and description but how do I get the news image by using the feedzilla API..
Nothing mentioned in the API documentation
http://w.feedzilla.com/api-overview

When you get your list of articles (e.g. http://api.feedzilla.com/v1/categories/26/articles.json) you can see that some of them have an attribute called 'enclosures' where feedzilla puts information about possible images of an article. As an example an article containing an image as JSON:
{"enclosures":[{"length":3489,"media_type":"image\/jpeg","uri":"http:\/\/news.com.au.feedsportal.com\/c\/34564\/f\/632403\/e\/1\/s\/36174fe5\/l\/0Lresources30Bnews0N0Bau0Cimages0C20A140C0A10C190C122680A50C4485310Erosetta0Bjpg\/448531-rosetta.jpg"}],"publish_date":"Sun, 19 Jan 2014 14:00:00 +0100","source":"Cnet","source_url":"http:\/\/feeds.feedburner.com\/newscomauworldnewsndm","summary":"ONE of the most ambitious missions in the history of space goes into high-risk\u000amode today when a comet-chasing probe wakes up.\u000a\u000a\u000a\u000a\u000a\u000a\u000a\u000a\u000a","title":"Comet-chasing probe to wake up (Cnet)","url":"http:\/\/news.feedzilla.com\/en_us\/stories\/top-news\/353840990?client_source=api&format=json"}
Note that this attribute is optional.

Related

Scraping Comments from a Reddit Post?

I found this reddit post here - https://www.reddit.com/r/obama/comments/xgsxy7/donald_trump_and_barack_obama_are_among_the/ .
I would like to use the API in such a way, such that I can get all the comments from this post.
I tried looking into the documentation of this API (e.g. https://github.com/pushshift/api) and this does not seem possible? If somehow I cold get the LINK_ID pertaining to this reddit post, I think I would be able to do it then.
Is this possible to do?
Thanks!
The Link Id of the post is in the URL
https://www.reddit.com/r/obama/comments/xgsxy7 <-- id
You could even search https://www.reddit.com/xgsxy7 to get the information.
If you fetch at the endpoint https://www.reddit.com/xgsxy7.json you would get the JSON information, you should then access the object to find them.
JS example:
const data = fetchedJSONObject;
const comments = data[1].data.children.map(comment => comment.data.body); // to get the text body
And you can just analyze the JSON object and get all the data you want from it: if the comment has some nested replies to it, time created, author, etc.
I would recommend you use WebScrapingAPI's extract_rules feature, which returns an array of elements you can extract using the CSS selector. For example, I used [data-testid='comment'] as a CSS selector in the following GET request:
https://api.webscrapingapi.com/v1?api_key=<YOUR_API_KEY>&url=https://www.reddit.com/r/obama/comments/xgsxy7/donald_trump_and_barack_obama_are_among_the/&render_js=1&extract_rules={"comments":{"selector":"[data-testid='comment']", "output":"text"}}
And I got:
{
"comments":[
"I wonder what's the most number of living ex-presidents there have been at one time?",
"The highest number is six—occurring in four different periods in history. The most recent period was 2017-2018 before GHW Bush died.",
"I don't understand what the first half of your title is doing there, other than to confuse and cause a person to have to read the whole title a couple of times to work out that all the living ex-presidents are invited to QEII's DC memorial service.",
"Agreed, OP is pretty awful at writing headlines.",
"Former disgraced president trump",
"No, he's still disgraced.",
"If the link is behind a paywall, or for an ad-free version:outline.comOr if you want to see the full original page:archive.org or archive.fo or 12ft.ioOr Google cache:https://www.google.com/search?q=site:https://www.townandcountrymag.com/society/politics/a41245384/donald-trump-barack-obama-george-bush-queen-elizabeth-memorial/I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns."
]
}

links absent when reading html into R

I am attempting to get a list of all links of satellite data for a year/month on the European Space Agency's Cryosat-2 website (https://science-pds.cryosat.esa.int/#Cry0Sat2_data%2FSIR_SAR_L2%2F2013%2F02). No matter what web scraping or html reading package I use, the links are never included. Below is an example of such an attempt with the url provided, but it is by no means my only attempt. I am looking for an explanation as to why the links that initiate the download of individual files aren't extracted, and what the solution is to obtaining them.
library(textreadr)
html_string<- 'https://science-pds.cryosat.esa.int/#Cry0Sat2_data%2FSIR_SAR_L2%2F2013%2F02'
html_read<- read_html(html_string)
html_read
[1] "Layer 1" "European Space Agency"
[3] "CryoSat-2 Science Server" "The access and use of CryoSat-2 products are regulated by the"
[5] "ESA's Data Policy" "and subject to the acceptance of the specific"
[7] "Terms & Conditions" "."
[9] "Users accessing CryoSat-2 products are intrinsically acknowledging and accepting the above." "Name"
[11] "Modified" "Size"
[13] "ESA European Space Agency"
Ok, here is a solution. In this kind of case where you can't get the info with regular scraping (rvest and so on), two solutions:
or you get the info with RSelenium and it can be tedious
or you inspect the page and monitor the XHR request (with the element inspector of firefox for example: network, and XHR).
You will find out that the data are loaded from urls looking like:
https://science-pds.cryosat.esa.int/?do=list&maxfiles=500&pos=5500&file=Cry0Sat2_data/SIR_SAR_L2/2021/02
These pages appear like html, but if you open them, they are not. They are JSON. The webpage just display dynamically the info requested as JSON.
So you can simply get the info as follow:
url <- 'https://science-pds.cryosat.esa.int/?do=list&maxfiles=500&pos=5500&file=Cry0Sat2_data/SIR_SAR_L2/2021/02'
library(jsonlite)
fromJSON(url)
$success
[1] TRUE
$is_writable
[1] FALSE
$results
mtime size name
1 1616543429 37713 CS_OFFL_SIR_SAR_2__20210223T041611_20210223T042157_D001.HDR
2 1616543428 845594 CS_OFFL_SIR_SAR_2__20210223T041611_20210223T042157_D001.nc
3 1616543364 37713 CS_OFFL_SIR_SAR_2__20210223T043539_20210223T043844_D001.HDR
4 1616543363 528578 CS_OFFL_SIR_SAR_2__20210223T043539_20210223T043844_D001.nc
5 1616543321 37713 CS_OFFL_SIR_SAR_2__20210223T044915_20210223T045113_D001.HDR
6 1616543322 387650 CS_OFFL_SIR_SAR_2__20210223T044915_20210223T045113_D001.nc
7 1616543360 37713 CS_OFFL_SIR_SAR_2__20210223T045156_20210223T045427_D001.HDR
8 1616543359 456414 CS_OFFL_SIR_SAR_2__20210223T045156_20210223T045427_D001.nc
9 1616543328 37713 CS_OFFL_SIR_SAR_2__20210223T045551_20210223T045749_D001.HDR
10 1616543327 385998 CS_OFFL_SIR_SAR_2__20210223T045551_20210223T045749_D001.nc
path
1 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T041611_20210223T042157_D001.HDR
2 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T041611_20210223T042157_D001.nc
3 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T043539_20210223T043844_D001.HDR
4 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T043539_20210223T043844_D001.nc
5 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T044915_20210223T045113_D001.HDR
6 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T044915_20210223T045113_D001.nc
7 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T045156_20210223T045427_D001.HDR
8 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T045156_20210223T045427_D001.nc
9 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T045551_20210223T045749_D001.HDR
10 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T045551_20210223T045749_D001.nc
These should give you all the info you need. If you tweak a bit the url, you should be able to get all the info from the date you want.
The problem seems to be that the page is dynamic. It probably has some JS code and only loads the links after it runs. So when you get the HTML from the link, you're only getting the base page (before the JS runs).
I can think of two possible solutions:
You can try to use selenium, which emulates an user in the browser so it will load the page completely, but the set-up might be a bit complicated. See for an intro https://www.r-bloggers.com/2014/12/scraping-with-selenium/
The page probably sends an HTTP request to get the links from an API, you can try to figure out the exact request. The network tab on you browser is a good place to start.

Can I make a Rainmeter widget that fetches the nearest/town or city via HTML5 geolocation?

I'm trying to build a Rainmeter widget that fetches the nearest town to the user and displays it onscreen. I'm currently trying to use the webparser function and this website, but it doesn't seem to be working. The code I've adapted from the example on the Rainmeter website is below - any ideas?
[Rainmeter]
Author=Rainmeter staff
Update=1000
;[WEBSITE MEASURES]===============================
[MeasureWebsite]
Measure=Plugin
Plugin=WebParser
UpdateRate=1800
URL=http://locationdetection.mobi/
RegExp="(?siU)<span style="color:white;">town_city:</span> <b>"(.*)"</b>.*"
[MeasureTown]
Measure=Plugin
Plugin=WebParser
Url=[MeasureWebsite]
StringIndex=1
;[DISPLAY METERS]==================================
[TextStyle]
X=2
Y=17
FontFace=Segoe UI
FontSize=32
FontColor=#454442
StringStyle=Bold
Antialias=1
[MeterTown]
MeasureName=MeasureTown
Meter=String
MeterStyle=TextStyle
Y=2
I'm not sure if it's too late but... your URL doesn't seem to contain any town_city information... at least not in the source code.
Rainmeter reads the raw HTML code from the URL you give to it and processes it through the provided RegEx. Without proper information in the website you can never get it to work.

How to display blog post in wordpress?

I have a wordpress theme "Hatch", and I'm doing it for my photography. I usually do websites with HTML/CSS (in Dreamweaver), and this is the first time doing Wordpress.
In My homepage, you can see the recent posts as thumbnails. I'm thinking of creating a new menu, called 'Blog', basically just like what normal themes do, displaying blog posts. It might be something simple, but i just can't find what to code to make the posts display as normal display .
the website is lizettephotography.com
thanks heaps!
Liz
No need to code.
Create new category "Blog". Add that category to your main menu.
After that add new post to Blog category. on click blog menu link will show all post having blog category.
You need to format post design as you want.
EDIT :
In order to activate “post formats” in WordPress 3.1+, you will need to open your theme’s functions.php file and paste the following code:
add_theme_support( 'post-formats', array( 'aside', 'gallery' ) );
Note: aside, and gallery are not the only available post formats. The available list of post formats are:
aside – Typically styled blog format.
chat – A chat transcript.
gallery – A gallery of images.
link – A link to another site.
image – A single image.
quote – A quotation.
status – A short status update, usually limited to 140 characters. Similar to a Twitter status update.
video – A single video.
For the full list of post formats, refer to WordPress Codex.
Once you have added this code, you will see a new field in your post write panel in the right hand column where you see publish.
Upon writing the post, you can change the format and hit publish. This will allow you to display your post in a pre-styled format.
Edit your post loop.
Suppose in your case blog category post format is asid
We are going to be utilizing the conditional tag: has_post_format()
if ( has_post_format( 'aside' ) {
// Blog Category format
}
else
{
// Normal Formate
}
I hope this will help you. More Info...

Groovy Project (html parsing, file downloading, file creating)

I am considering starting a project so that I can learn more and keep the things I have learned thus far from getting rusty.
A lot of the project will be new things so I thought I would come here and ask for advice on what to do and how to go about doing it.
I enjoy photoshop and toying around with it, so I thought I would mix my project with something like that. So I decided my program will do something along the lines of grab new resources for photoshop put them in their own folder on my computer. (from deviantart for now)
For now I want to focus on a page like this:
http://browse.deviantart.com/resources/applications/psbrushes/?order=9
I'm not fluent with understanding exactly what is going on in the html source so it is a bit hard to see what is going on.
But lets say I am on that page and I have the following options chosen:
Sorted by Popular
Sorted by All Time
Sorted by 24 Items Per Page
My goal is to individually go to each thumbnail and grab the following:
The Author
The Title
The Description
Download the File (create folder based on title name)
Download the Image (place in folder with the file above)
Create text file with the author, title, and description in it
I would like to do that for each of the 24 items on the page and then go to the next page and do the same. (I am thinking of just going through the first five pages as I don't have too much interest in trying out brushes that aren't too popular)
So, I'm posting this for a sense of direction and perhaps some help on how to parse such a page to get what I'm looking for. I'm sure this project will keep me busy for awhile, but I'm hoping it will become useful in teaching me things.
Any help and suggestions are always appreciated.
.
.
EDIT
Each page is made up of 24 of these:
<div class="tt-a" usericon="http://a.deviantart.net/avatars/s/h/shad0w-gfx.gif" collect_rid="1:19982524">
<span class="shad0w" style="background-image: url ("http://sh.deviantart.net/shad0w/x/107/150/logo3.png");">
<a class="t" title="Shad0ws Blood Brush Set by ~Shad0w-GFX, Jun 28, 2005" href="http://Shad0w-GFX.deviantart.com/art/Shad0ws-Blood-Brush-Set-19982524?q=boost%3Apopular+in%3Aresources%2Fapplications%2Fpsbrushes&qo-0">Shad0ws Blood Brush Set</a>
My assumption is, I want to grab all my information from the:
<a class="t" ... >
Since it contains the title, author, and link to where the download url and large image is located.
If this sounds correct, how would one go about getting that info for each object on the page. (24 per page) I would assume by using CyberNeko. I'm just not exactly sure how to get to the proper level where is located and for each of them on the page
.
.
EDIT #2
I have some test code that looks like this:
divs = []
client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false
page = client.getPage("http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0")
divs = page.getByXPath("//html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a[#class='t']")
divs.each { println it }
The XPath is correct, but it prints out:
<?xml version="1.0" encoding="UTF-8"?><a href="http://Shad0w-GFX.deviantart.com/
art/Shad0ws-Blood-Brush-Set-19982524?q=boost%3Apopular+in%3Aresources%2Fapplicat
ions%2Fpsbrushes&qo=0" class="t" title="Shad0ws Blood Brush Set by ~Shad0w-G
FX, Jun 28, 2005">Shad0ws Blood Brush Set
Can you explain what I need to do to just get the href out of there? Is there a simple way to do it with HtmlUnit?
Meeting the requirements you've listed above is actually pretty easy. You can probably do it with a simple Groovy script of about 50 lines. Here's how I would go about it:
The URL of the first page is
http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0
To get the next page, simply increase the value of the offset parameter by 24:
http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=24
So now you know how to construct the URLs for the pages you need to work with. To download the content of this page use:
def pageUrl = 'http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0'
// get the content as a byte array
byte[] pageContent = new URL(pageUrl).bytes
// or get the content as a String
String pageContentAsString = new URL(pageUrl).text
Now all you need to do is parse out the elements of the content that you're interested in as save it in files. For the parsing, you should use a HTML parser like CyberNeko or Jericho.