Scraping Comments from a Reddit Post?

Scraping Comments from a Reddit Post? - json

I found this reddit post here - https://www.reddit.com/r/obama/comments/xgsxy7/donald_trump_and_barack_obama_are_among_the/ .
I would like to use the API in such a way, such that I can get all the comments from this post.
I tried looking into the documentation of this API (e.g. https://github.com/pushshift/api) and this does not seem possible? If somehow I cold get the LINK_ID pertaining to this reddit post, I think I would be able to do it then.
Is this possible to do?
Thanks!

The Link Id of the post is in the URL
https://www.reddit.com/r/obama/comments/xgsxy7 <-- id
You could even search https://www.reddit.com/xgsxy7 to get the information.
If you fetch at the endpoint https://www.reddit.com/xgsxy7.json you would get the JSON information, you should then access the object to find them.
JS example:
const data = fetchedJSONObject;
const comments = data[1].data.children.map(comment => comment.data.body); // to get the text body
And you can just analyze the JSON object and get all the data you want from it: if the comment has some nested replies to it, time created, author, etc.

I would recommend you use WebScrapingAPI's extract_rules feature, which returns an array of elements you can extract using the CSS selector. For example, I used [data-testid='comment'] as a CSS selector in the following GET request:
https://api.webscrapingapi.com/v1?api_key=<YOUR_API_KEY>&url=https://www.reddit.com/r/obama/comments/xgsxy7/donald_trump_and_barack_obama_are_among_the/&render_js=1&extract_rules={"comments":{"selector":"[data-testid='comment']", "output":"text"}}
And I got:
{
"comments":[
"I wonder what's the most number of living ex-presidents there have been at one time?",
"The highest number is six—occurring in four different periods in history. The most recent period was 2017-2018 before GHW Bush died.",
"I don't understand what the first half of your title is doing there, other than to confuse and cause a person to have to read the whole title a couple of times to work out that all the living ex-presidents are invited to QEII's DC memorial service.",
"Agreed, OP is pretty awful at writing headlines.",
"Former disgraced president trump",
"No, he's still disgraced.",
"If the link is behind a paywall, or for an ad-free version:outline.comOr if you want to see the full original page:archive.org or archive.fo or 12ft.ioOr Google cache:https://www.google.com/search?q=site:https://www.townandcountrymag.com/society/politics/a41245384/donald-trump-barack-obama-george-bush-queen-elizabeth-memorial/I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns."
]
}

Related

Using R-selenium to scrape data from an aspx webpage

I am pretty new to r and selenium so hopefully i can express myself clearly about my question.
I want to scrape some data off a website (.aspx) and i need to type some chemical code to be able to pull out some information in the next page (using R-selenium to input and click element). So far i have been able to build a short code that will get me through the first step, i.e. pull out the correct page i wanted. But i had so much trouble in finding a good way to scrape the data (the chemical information in the table) off this website. Mainly because the website will not assign a new html address instead of give me the same aspx address for any chemical i search. I plan to overcome this and then build a loop so i can scrape more information automatically. Anyone has any good thoughts that how i should get the data off after click-element? I need the chemical information table in the second page.
Thanks heaps in advance!
Here i put my code that i wrote so far: the next step i need is to scrape the table out the next page!
library("RSelenium")
checkForServer()
startServer()
mybrowser <- remoteDriver()
mybrowser$open()
mybrowser$navigate("http://limitvalue.ifa.dguv.de/")
mybrowser$findElement(using = 'css selector', "#Tbox_cas")
wxbox <- mybrowser$findElement(using = 'css selector', "#Tbox_cas")
wxbox$sendKeysToElement(list("64-19-7"))
wxbutton <- mybrowser$findElement(using = 'css selector', "#Butsearch")
wxbutton$clickElement()

First of all, your tool choice is wrong.
Secondly, in your case
POST to the "permanent" url
302 redirect to a new url, which is http://limitvalue.ifa.dguv.de/WebForm_ueliste2.aspx in your case
GET the new url
Thirdly, what's the ultimate output you are after?
It really depends on how much data you are up to. Otherwise do a manual task.

Can I make a Rainmeter widget that fetches the nearest/town or city via HTML5 geolocation?

I'm trying to build a Rainmeter widget that fetches the nearest town to the user and displays it onscreen. I'm currently trying to use the webparser function and this website, but it doesn't seem to be working. The code I've adapted from the example on the Rainmeter website is below - any ideas?
[Rainmeter]
Author=Rainmeter staff
Update=1000
;[WEBSITE MEASURES]===============================
[MeasureWebsite]
Measure=Plugin
Plugin=WebParser
UpdateRate=1800
URL=http://locationdetection.mobi/
RegExp="(?siU)<span style="color:white;">town_city:</span> <b>"(.*)"</b>.*"
[MeasureTown]
Measure=Plugin
Plugin=WebParser
Url=[MeasureWebsite]
StringIndex=1
;[DISPLAY METERS]==================================
[TextStyle]
X=2
Y=17
FontFace=Segoe UI
FontSize=32
FontColor=#454442
StringStyle=Bold
Antialias=1
[MeterTown]
MeasureName=MeasureTown
Meter=String
MeterStyle=TextStyle
Y=2

I'm not sure if it's too late but... your URL doesn't seem to contain any town_city information... at least not in the source code.
Rainmeter reads the raw HTML code from the URL you give to it and processes it through the provided RegEx. Without proper information in the website you can never get it to work.

Grab images from feedzilla

I am able to get the news data like post title and description but how do I get the news image by using the feedzilla API..
Nothing mentioned in the API documentation
http://w.feedzilla.com/api-overview

When you get your list of articles (e.g. http://api.feedzilla.com/v1/categories/26/articles.json) you can see that some of them have an attribute called 'enclosures' where feedzilla puts information about possible images of an article. As an example an article containing an image as JSON:
{"enclosures":[{"length":3489,"media_type":"image\/jpeg","uri":"http:\/\/news.com.au.feedsportal.com\/c\/34564\/f\/632403\/e\/1\/s\/36174fe5\/l\/0Lresources30Bnews0N0Bau0Cimages0C20A140C0A10C190C122680A50C4485310Erosetta0Bjpg\/448531-rosetta.jpg"}],"publish_date":"Sun, 19 Jan 2014 14:00:00 +0100","source":"Cnet","source_url":"http:\/\/feeds.feedburner.com\/newscomauworldnewsndm","summary":"ONE of the most ambitious missions in the history of space goes into high-risk\u000amode today when a comet-chasing probe wakes up.\u000a\u000a\u000a\u000a\u000a\u000a\u000a\u000a\u000a","title":"Comet-chasing probe to wake up (Cnet)","url":"http:\/\/news.feedzilla.com\/en_us\/stories\/top-news\/353840990?client_source=api&format=json"}
Note that this attribute is optional.

Groovy Project (html parsing, file downloading, file creating)

I am considering starting a project so that I can learn more and keep the things I have learned thus far from getting rusty.
A lot of the project will be new things so I thought I would come here and ask for advice on what to do and how to go about doing it.
I enjoy photoshop and toying around with it, so I thought I would mix my project with something like that. So I decided my program will do something along the lines of grab new resources for photoshop put them in their own folder on my computer. (from deviantart for now)
For now I want to focus on a page like this:
http://browse.deviantart.com/resources/applications/psbrushes/?order=9
I'm not fluent with understanding exactly what is going on in the html source so it is a bit hard to see what is going on.
But lets say I am on that page and I have the following options chosen:
Sorted by Popular
Sorted by All Time
Sorted by 24 Items Per Page
My goal is to individually go to each thumbnail and grab the following:
The Author
The Title
The Description
Download the File (create folder based on title name)
Download the Image (place in folder with the file above)
Create text file with the author, title, and description in it
I would like to do that for each of the 24 items on the page and then go to the next page and do the same. (I am thinking of just going through the first five pages as I don't have too much interest in trying out brushes that aren't too popular)
So, I'm posting this for a sense of direction and perhaps some help on how to parse such a page to get what I'm looking for. I'm sure this project will keep me busy for awhile, but I'm hoping it will become useful in teaching me things.
Any help and suggestions are always appreciated.
.
.
EDIT
Each page is made up of 24 of these:
<div class="tt-a" usericon="http://a.deviantart.net/avatars/s/h/shad0w-gfx.gif" collect_rid="1:19982524">
<span class="shad0w" style="background-image: url ("http://sh.deviantart.net/shad0w/x/107/150/logo3.png");">
<a class="t" title="Shad0ws Blood Brush Set by ~Shad0w-GFX, Jun 28, 2005" href="http://Shad0w-GFX.deviantart.com/art/Shad0ws-Blood-Brush-Set-19982524?q=boost%3Apopular+in%3Aresources%2Fapplications%2Fpsbrushes&qo-0">Shad0ws Blood Brush Set</a>
My assumption is, I want to grab all my information from the:
<a class="t" ... >
Since it contains the title, author, and link to where the download url and large image is located.
If this sounds correct, how would one go about getting that info for each object on the page. (24 per page) I would assume by using CyberNeko. I'm just not exactly sure how to get to the proper level where is located and for each of them on the page
.
.
EDIT #2
I have some test code that looks like this:
divs = []
client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false
page = client.getPage("http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0")
divs = page.getByXPath("//html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a[#class='t']")
divs.each { println it }
The XPath is correct, but it prints out:
<?xml version="1.0" encoding="UTF-8"?><a href="http://Shad0w-GFX.deviantart.com/
art/Shad0ws-Blood-Brush-Set-19982524?q=boost%3Apopular+in%3Aresources%2Fapplicat
ions%2Fpsbrushes&qo=0" class="t" title="Shad0ws Blood Brush Set by ~Shad0w-G
FX, Jun 28, 2005">Shad0ws Blood Brush Set
Can you explain what I need to do to just get the href out of there? Is there a simple way to do it with HtmlUnit?

Meeting the requirements you've listed above is actually pretty easy. You can probably do it with a simple Groovy script of about 50 lines. Here's how I would go about it:
The URL of the first page is
http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0
To get the next page, simply increase the value of the offset parameter by 24:
http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=24
So now you know how to construct the URLs for the pages you need to work with. To download the content of this page use:
def pageUrl = 'http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0'
// get the content as a byte array
byte[] pageContent = new URL(pageUrl).bytes
// or get the content as a String
String pageContentAsString = new URL(pageUrl).text
Now all you need to do is parse out the elements of the content that you're interested in as save it in files. For the parsing, you should use a HTML parser like CyberNeko or Jericho.

Filter out #replies in a Twitter feed?

I have a feed from my Twitter profile on the top of my site but I wondered if there is a way to filter out my #replies and only show my status updates?
Thanks

Maybe with Yahoo Pipes.
Tomalak has made a quick example for you.

If you're using the standard Twitter feed web code for Blogger and similar sites, this bit of Javascript does the trick. It sits between the Twitter feed and the callback and strips replies out of the server response.
For a blog badge, the standard Twitter web code ends with two <script> tags. The first provides the function that displays your tweets. The second queries twitter for the tweets to display.
Add this script to your badge code before the twitter query. It provides a new function called filterCallback which strips #replies from the Twitter response.
<script type="text/javascript">
function filterCallback( twitter_json ) {
var result = [];
for(var index in twitter_json) {
if(twitter_json[index].in_reply_to_user_id == null) {
result[result.length] = twitter_json[index];
}
if( result.length==5 ) break; // Edit this to change the maximum tweets shown
}
twitterCallback2(result); // Pass tweets onto the original callback. Don't change it!
}
</script>
The twitter query itself has a parameter which specifies what function to call when the response comes back. In blogger's case, that function is called 'twitterCallback2' - you can search for it in the web code (look for callback=twitterCallback2). To use the new filter you need to replace the text twittercallback2 with filterCallback. The filter is hard coded to then call twitterCallback2 when it's done.
Note that as this will reduce the number of displayed tweets if some of the repsonses from Twitter are replies, so you have to increase the count parameter in the call to allow for that. The new function then limits the number of displayed replies to five - edit the code to change that.
Here's my blog post about it: Filter Replies out of Twitter Feed

If you want to use the new Twitter widgets, just add this piece of code within the features: setting of the widget's source code:
filters: {
negatives: /\B#\w{1,20}(\s+|$)/
},
I took this one from Dustin Diaz's website at http://www.dustindiaz.com. Dustin Diaz is the creator of the Twitter widget.

Change the setUser call to
setUser('name&exclude_replies=true');
This is kind of a hack but it does the trick

Depends on what you're using to display the entries. If you're using Twitter's widget, then probably not. If you're using some other programmatic way of displaying the items, you'd need to provide more details about what you're doing (language, sample code, etc) and we can probably help with filtering.

You'll probably want to use a regular expression. Something along the lines of:
[a-zA-Z0-9][a-zA-Z0-9]*: #[a-zA-Z0-9][a-zA-Z0-9]*.*
Depending on how you are formatting your twitter feed on your page. This regex assumes that you're formatted something like:
username: #username msg txt
If it matches, don't display it. If it doesn't match, then display it. :) If you've got tags in there along with the text, adjust the regex appropriately.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Scraping Comments from a Reddit Post? - json

Related

Using R-selenium to scrape data from an aspx webpage

Can I make a Rainmeter widget that fetches the nearest/town or city via HTML5 geolocation?

Grab images from feedzilla

Groovy Project (html parsing, file downloading, file creating)

Filter out #replies in a Twitter feed?

Categories

Resources