News Aggregater of sorts - html

There is a website that my company uses that updates information about 3 specific things throughout the day. We use the information from 1 of them and what we are wanting to do is pull this information as it is added to their site and add it to a page of our own to view easier. Is this even possible? Can anyone point me in the direction of setting this up? It is all text that we want to pull.

Pick a language (e.g. Perl). Find an HTTP library for it (e.g. LWP). Fetch the page and run it through an HTTP parser (e.g. HTML::TreeBuilder). Pull out the bits you want and shove them into a template (e.g. TT) then dump to a file. Stick the program in cron or Windows Scheduler.

Related

Script for counting total pages on website

I am trying to write a script that will check our website everyday for the total amount of web pages we have. How can I do this using an API like Google Analytics? Using JSON would be nice. So here is what it might look like and maybe someone can help please?
{
"startDate": "{date.startOfMonth.format()}",
"endDate": "{date.today}",
"dimensions": ["query","page"]
}
As nyuen has pointed out you cannot count every page in your web presence with Google Analytics. GA will only register pages that a) have GA tracking code and b) have executed this tracking code at least once in your selected timeframe. Usually that's most of the pages, but you can't be sure.
What you can do is issuing a query that queries the page path dimension and at least one metric - pageviews would be obvious. That's not because you actually need the number of pageviews for your purpose, but because a query without at least one metric will not work. Send the query via the API or the query explorer and then simply count the number of rows in the result set. Since the page path is unique the number of results is the number of distinct pages with pageviews in the selected timeframe, which is the closes you will get with GA.
But there are actually tools for what you are trying to do, so you might want to start with those - for example you might have your script make a system call (assuming a linux system) to wget with the --spider option which will create a list of files on a given domain. This does not require tracking code (it works by following links in the pages source code). There is also web spider software like Screaming frog on Windows (doesn't really work in a script, but I guess Windows has some task scheduling tool that allow you to start programs at pre-defined times) which does not only do the counting but also returns information about the health of your site (dead links etc).
Or, since this seems to be your server, you might write a script that traverses the file system and makes a list of the files it encounters there (will not work if your pages are dynamically generated, since this counts only physical files).
Or you write a script that parses your server logs and extracts call to content files (will work only for files that have actually been viewed).
So there are a number of better alternatives to using Google Analytics for that purpose, you might want to look into one of them first.

Design guidance for Mobile RSS Feed Reader

I am working on a mobile (windows phone) RSS Feed Reader pet project.
I had a few design decisions on which I needed guidance. Here they are:
Firstly, when the feed reader downloads the RSS feed how can I show which items are read by the user vs. those that are new or not-read.
Do I store the file contents locally in a DB along with the information of which articles are read/unread.
Secondly, when we download the RSS feed, do we download the entire file? I guess even with an approach like CONDITIONAL-GET to fetch feed contents only on update, there is no way to download a delta of the RSS file.
Thirdly, if the entire file is downloaded, do mobile feed readers show data from beginning or truncate it to show feeds only from last N days (where N is an integral number of days like 15 or 30)
Thanks in advance
Regards
Vikas
Since you're building a mobile phone application, there are 2 ways you could go: either have a server keep track of the updated content and let the devices connect to it to retrieve it or handle everything (fetching, parsing and diffing the feeds) from the phones.
For your questions:
You have to keep track locally (on the device) of what the user has read or not. You'll probably use the <guid> or <id>(Atom) elements in feed's entries to identify each of them and keep track of lists of items that were read.
Conditional GETs (if-Modified-Since and ETag) will not help you very much because when the content has been updated, they serve you the whole RSS/Atom document. So, yes, you'll have to download the whole file over and over again, and yes, that's a lot of waste.
This is really up to you. It may actually be relatively "cheap" to store everything on the device and let the user decide if they want to delete the past stories.
If you don't want to deal with the hassle of fetching, parsing and diffing RSS feeds, I suggest you check services like Superfeedr which can do that on your behalf either on the server or on the device!

Get all files in box account

I need to fetch a list of all the files in a user's box account, such that the list of files can then be displayed in a table view (iOS).
I have successfully implemented this by recursively using /folders/{folder id}/items on all the folder's in my user's box.
However, while this works, it's kind of dirty, seeing as how a request is made for each of the users's folders, which could be quite a large number.
Is there any way to get a list of all the files (it's no issue if folders are included, I can ignore those manually) available?
I tried implementing this using search, but I couldn't identify a value for the query parameter that returned everything.
Any help would be appreciated.
Help me, Obi-Wan Kenobi. You're my only hope.
What you are looking for (recursive call through a Box account) is not available. We have enterprise customers will bajillions of files and millions of folders. Recursively asking for everything would take too long.
What we generally recommend is that you ask for as little as you can, and that you use multiple threads and anticipate what you'll need just a little bit, so that you can deliver a high-performance user-interface to your end-users.
For example ?fields=item_collection is expensive to retrieve, and can add a lot to a paylaod. It can double, or 10x the time that it takes to get back a payload from the Box API. Most UI's don't need to show all the items inside every folder. So they are better off asking for ?fields=.
You can make your application responsive to the user if you make the smallest possible call. Of course there is a balance. Mobile networks have high latency, and sometimes that next API call to show some extra thing is slow. But for a folder tree, you can get high performance by retrieving only the current level, displaying that, and then starting to fetch one-level down while the user is looking at the first level.
Same goes for displaying thumbnails. If a user drills into a folder and starts looking at thumbnails for pictures, there's a good chance they'll want to see other thumbnails in that same folder. Your app should anticipate that, and start to pull one or two extras down in the background. Yes, it means more API calls, but your users will give your app a higher rating for being fast.

How do I redirect Crowdflower users to my website?

I have a specific problem: I would like to create a Crowdflower job, in which the participant will be redirected to my website (let's say http://www.xxx.yy), where he will complete the task and after he finishes, he'll be redirected back to Crowdflower and paid. Is it possible to do something like that?
I imagined they would have an API, where the user would get some token, which would be then sent to my website and after the completion of the job I could simply do some API call to mark the task as finished. However, I can't find anything in their documentation that would do such thing (http://success.crowdflower.com/customer/portal/articles/1288323).
The reason I need to redirect users to my website is that I need more freedom than CML (Crowdflower Markup Language which is used for creating tasks) offers:
I need to be able to embed an swf file
the swf file should be chosen randomly from an aray of files (approx. 10)
I need to be able to measure how long s/he spends on the website and act based on that time
store some data into a database
All these things can be done pretty easily using some Javascript and PHP, but i don't think they can be done in CML, that's why I need to redirect them to my website.
Can you, please, give me some advice how to do this?

Markdown or HTML

I have a requirement for users to create, modify and delete their own articles. I plan on using the WMD editor that SO uses to create the articles.
From what I can gather SO stores the markdown and the HTML. Why does it do this - what is the benefit?
I can't decide whether to store the markdown, HTML or both. If I store both which one do I retrieve and convert to display to the user.
UPDATE:
Ok, I think from the answers so far, i should be storing both the markdown and HTML. That seems cool. I have also been reading a blog post from Jeff regarding XSS exploits. Because the WMD editor allows you to input any HTML this could cause me some headaches.
The blog post in question is here. I am guessing that I will have to follow the same approach as SO - and sanitize the input on the server side.
Is the sanitize code that SO uses available as Open Source or will I have to start this from scratch?
Any help would be much appreciated.
Thanks
Storing both is extremely useful/helpful in terms of performance and compatiblity (and eventually also social control).
If you store only Markdown (or whatever non-HTML markup), then there's a performance cost by parsing it into HTML flavor everytime. This is not always noticeably cheap.
If you store only HTML, then you'll risk that bugs are silently creeping in the generated HTML. This would lead to lot of maintenance and bugfixing headache. You'll also lose social control because you don't know anymore what the user has actually filled in. You'd for example as being an admin also like to know which users are trying to do XSS using <script> and so on. Also, the enduser won't be able to edit the data in Markdown format. You'd need to convert it back from HTML.
To update the HTML on every change of Markdown version, you just add one extra field representing the Markdown version being used for generating the HTML output. Whenever this has been changed in the server side at the moment you retrieve the row, re-parse the data using the new version and update the row in the DB. This is only an one-time extra cost.
By storing both you only have to process the markdown once (when it is posted). You would then retrieve the HTML so that you can load your pages faster.
If you only stored one, you'd forever have to recreate the other for either the display view or the edit view.