Why isn't this Yahoo Pipe outputting items? - html

I have a Yahoo! Pipe that attempts to transform an HTML page into RSS, but the resulting feed contains no items. For each entry I've parsed these elements:
link (permalink)
title (HTML title)
description (HTML entry)
guid (segment of the permalink)
Various tutorials led me to add these:
dc:creator ("Doug")
y:id.value (permalink)
y:published (w/ date attributes generated from text like "3 days ago")
If you edit source and highlight Pipe Output, the debugger shows 5 entries with these elements/attributes intact.
What am I missing?

That is vexing! By tweaking it a bit to do "emit results" in the "Loop" operator box I managed to get a feed with 5 items, but it only contained the item.guid for some reason.
Your feed is valid though (not that hard considering there's no elements) according to http://feedvalidator.org.
I tried removing some of your components but my changes did not help.
By the way, crazy nuts that they kill Yahoo360 blogs in favor of the feedless Yahoo Profile blogs. Oh, and I like Douglas Crockford too. :-)

Related

convert docx with (ordered) list to html

I'm trying to convert a large docx document with several layers' ordered list to an html. (see an example of the document here: http://docdro.id/X1oyfBv You should download it)
I tried the following things, including:
online converters such as html-cleaner and index.html (which only recognize one layer of the list)
save as html - which creates an horrendous file but still doesn't recognize the ol structure.
saved the file as zip and then opened the xml file, but I dont see an easy way to get the ol structure out of the w:... tags
saving it to google docs and running Omar Alzabir's script
http://omaralzabir.com/wp-content/uploads/2014/05/GoogleDocsEmail.jpg
btw. If I create a word file with an ordered list with multiple layers and i convert it, it does recognize it as ol's. But the existing file is not recognized as ol's even if I 'un-list' and list it again. So possibly there is something wrong with how the original document was created (?)
Any suggestions much appreciated:) Or indications as to why this problem occurs
Are you asking how to save a Word-doc in HTML format, with multi-level ordered-lists?
Word-HTML has bugs in its multi-level ordered lists. For the list-items, the indentation tends to be incorrect and inconsistent. There's an example here.
Word-HTML has similar bugs in its multi-level unordered lists. An example is here.
I recently wrote a Python program that fixes these bugs, in Word's HTML. The program is part of WordWebNav (WWN), which is free and open-source.
WWN is an app that converts a Microsoft-Word document to a usable web-page. It adds some missing features in the Word-HTML web-page (e.g., a navigation pane), and it fixes bugs in the Word-HTML.
You can use pandoc : https://github.com/jgm/pandoc
This is an open source universal command line tool to convert markup source based document files.
You can use it as something like that:
pandoc -o output.html input.docx

extracting value from a <ul> with specific text text using HTMLAgilitypack

I'm trying to extract a link from http://www.raws.dri.edu/cgi-bin/rawLIST.pl?idIAN1+id
this site contains an unsorted list and I want to get the link for Daily Summary.
So far I've tried using an xpath string of "//ul/li/a" using the .SelectNodes() method. Doing so returns only the first item in the list which is what I want but ultimately in the future I may want to get the link to a different page so being able to specify which link to retrieve is what I need.
If you use //ul/li/a, you should get all the <a> links, not one.
If you want to extract the links that contain some text (e.g. Time Series Graph), you can do:
//ul/li/a[contains(text(), 'Time Series Graph')]
Similar, if you're looking for some specific text in the href attribute:
//ul/li/a[contains(#href, 'Time Series Graph')]
By the way, I see you have asked many questions pointing to the same website, etc. My suggestion is: Learn a little bit of XPath, the basics, and read a tutorial about how HtmlAgilityPack works (pretty simple once you understand the basics of XPath), and then start working on that scraper.

Full urls of images of a given page on Wikipedia (only those I see on the page)

I'd want to extract all full urls of images of "Google"'s page on Wikipedia
I have tried with:
http://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json
but, in this way, I got also not google-related images, such as:
http://upload.wikimedia.org/wikipedia/en/a/a4/Flag_of_the_United_States.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/commons/f/fe/Crystal_Clear_app_browser.png
How can I extract just only images that I see on Google page
Retrieve page source code, https://en.wikipedia.org/w/index.php?title=Google&action=raw
Scan it for substrings like [[File:Google web search.png|thumb|left|On February 14, 2012, Google updated its homepage with a minor twist. There are no red lines above the options in the black bar, and there is a tab space before the "+You". The sign-in button has also changed, it is no longer in the black bar, instead under it as a button.]]
Ask API for all pictures on page, http://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json
Filter out urls but those which match picture names found in step 2.
Steps 2 and 4 need more explanation.
#2. Regexp /\b(File|Image):[^]|\n\r]+/ should be enough. In Ruby's regexps, \b denotes word boundary which might be unsupported in language of your choice. Regexp I proposed will match all cases which come to my mind: [[File:something.jpg]], gallery tags: <gallery>\nFile:one.jpg\nFile:two.jpg\n</gallery>, templates: {{Infobox|pic = File:something.jpg}}. However, it won't match filenames which contain ]. I'm not sure if they're legal, but if they are, they must be very uncommon and it should not be a big deal.
If you want to match only constructs like this: [[File:something.jpg|thumb|description]], following regexp will work better: /\[\[(File|Image):[^]|]+/
#4. I'd remove all characters from names which match /[^A-Za-z0-9]/. It's easier than escaping them and, in most cases, enough.
Icons are most often attached in templates, contrary to pictures related to article subject, which are most often attached directly ([[File:…]]). There are exceptions though, for example in some articles pictures are attached with {{Gallery}} template. There is also <gallery> tag which introduces special syntax for galleries. You got to tune my solution to your needs, and even then it won't be perfect, but it should be good enough.

Export Excel file to create html webpages

I'd like to know how you would try to solve the following problem:
I have an excel-spreadsheet containing "linked information" (like a matrix-type) of business processes which need to be transformed to HTML websites.
Only certain parts of the spreadsheet should be exported.
The needed data consists of a hierarchical representations of certain categories that hold different components.
So for example category 1 consists of component A which has sub-components A1, A2 and so on.
The goal is to represent that single excel spreadsheet with html websites where the main-categories lead to pages with subcategories (always listing which subcategories they hold) and so on. Kind of like a process or business flow-chart.
Whenever something gets changed, added, removed within the spreadsheet I'd like to reflect this new information with the webpages accordingly.
The important part would be not having to edit several webpages but have everything rebuild at once - with the right structure.
My first thoughts were to define one XSD file to extract and transform the data with XSL and there create the final web structure. I'm not quite sure how time-intensive this would be and if I could actually have a satisfying outcome.
Maybe you have a better solution for me or you can point me to some link where something similar is accomplished.
I hope I could get my problem across.
Thanks for your time.
UPDATE
I made a simple version of my spreadsheet.
|*Sub*|*Description*|*Key*|
|SubName|some text| 11|
|SubName|some text| 11|
|SubName|some text| 21|
|SubName|some text| 22|
Here the "key"-column is needed to structure the final html layout where 11 and 12 belong to an even higher category 10 which later needs to be added to the result set. What also needs to be added is a "title-category" with the highest level of 1, 2 etc.
I want to reach a point where I can create an html webpage with the title categories being listed (just like headlines) and (on the same page) in some sort of rectangle frame one can see the next level of categories (here 10 and 20) which work as a link and take one to another webpage displaying category 10 and 20 now as headlines and have the sub-categories listed and clickable to reach the final, detailed table listing. So basically it's a top-to-bottom drill down of information.
I have three excel files with these title categories (for example: customers, orders, services)
Returning these three spreadsheets in one html webpage would be the goal. And from there one could click through to the detail pages. For now I'd be happy just to get one spreadsheet in order.
Has anyone got a good idea how I can:
a) write a schema-file to receive a proper xml file,
b) and of course turn the xml file to an html file.
if you can point me to some examples with a similar problem, I'd be happy as well.
thanks for your support.

Parsing a website and getting the info I need

hi so I need to retrieve the url for the first article on a term I search up on nytimes.com
So if I search for Apple. This link would return the result
http://query.nytimes.com/search/sitesearch?query=Apple&srchst=cse
And you just replace Apple with the term you are searching for.
If you click on that link you would see that NYtimes ask you if you mean Apple Inc.
I want to get the url for this link, and go to it.
Then you will just get a lot of information on Apple Inc.
If you scroll down you will see the articles related to Apple.
So what I ultimately want is the URL of the first article on this page.
So I really do not know how to go about this. Do I use Java, or what do I use? Any help would be greatly appreciated and I would put a bounty on this later, but I need the answer ASAP.
Thanks
EDIT: Can we do this in Java?
You can use Python with the standard urllib module to fetch the pages and the great HTML parser BeautifulSoup to obtain the information you need from the pages.
From the documentation of BeautifulSoup, here's sample code that fetches a web page and extracts some info from it:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
where, linebreak, what = incident.contents[:3]
print where.strip()
print what.strip()
print
This this is a nice and detailed article on the topic.
You certainly can do it in Java. Look at the HttpURLConnection class. Basically, you give it a URL, call the connect function, and you get back an input stream with the contents of the page, i.e. HTML text. You can then process that and parse out whatever information you want.
You're facing two challenges in the project you are describing. The first, and probably really the lesser challenge, is figuring out the mechanics of how to connect to a web page and get hold of the text within your program. The second and probably bigger challenge will be to figure out exactly how to extract the information you want from that text. I'm not clear on the details of your requirements, but you're going to have to sort through a ton of text to find what you're looking for. Without actually looking at the NY Times site at the momemnt, I'm sure it has all sorts of decorations like pretty pictures and the company logo and headlines and so on, and then there are going to be menus and advertisements and all sorts of stuff. I sincerely doubt that the NY Times or almost any other commercial web site is going to return a search page that includes nothing but a link to the article you are interested in. Somehow your program will have to figure out that the first link is to the "subscribe on line" page, the second is to an advertisement, the third is to customer service, the fourth and fifth are additional advertisements, the sixth is to the home page, etc etc until you finally get to the one you're actually interested in. How will you identify the interesting link? There are probably headings or formatting that make it recognizable to a human being, but you use a lot of intuition to screen out the clutter that can be difficult to reproduce in a program.
Good luck!
You can do this in C# using the HTML Agility Pack, or using LINQ to XML if the site is valid XHTML. EDIT: It isn't valid XHTML; I checked.
The following (tested) code will get the URL of the first search result:
var doc = new HtmlWeb().Load(#"http://query.nytimes.com/search/sitesearch?query=Apple&srchst=cse");
var url = HtmlEntity.DeEntitize(doc.DocumentNode.Descendants("ul")
.First(ul => ul.Attributes["class"] != null
&& ul.Attributes["class"].Value == "results")
.Descendants("a")
.First()
.Attributes["href"].Value);
Note that if their website changes, this code might stop working.