Parsing HTML into JSON - html

I've been tasked with getting all the SMS updates from this page and putting them into a JSON feed using Yahoo Pipes. I'm not entirely sure how I would get each update, as they are not individual elements, but just a collection of title, etc. Any shared wisdom would be much appreciated!

<h1 id="blogtitle">SMS Update</h1>
<div class="blogposttime blogdetail">Left at 2nd January 2010 at 01:12</div>
<div class="blogcategories blogdetail">Recieved by SMS (Location: Pokhara - Nepal)</div>
<p class="blogpostmessage">
RACE DAY! We took the extra day off to pimp the rick some more, including a huge Australian flag. Quiet night at a pub with 6 other teams. Time for brekkie and then we're off to the rickshaw grounds for 8:30 for 10am start.
</p>
That seems a fairely easy job for a DOM/XML parser.
Since the blocks are not enclosed in XML tags you could look for elements that are present in each block, for example the <h1 id="blogtitle">SMS Update</h1> defines the start of a new block.
Use your DOM parser to look for all the elements with id blogtitle. At this point you can use a DOM function to reference the nextSibling of the blogtitle element. All you need is the 3 siblings after the blogtitle element.
With a little work you can easily use this logic to build your JSON object.

Related

Parser user page information from Wikipedia. How to remove redundant information?

I'm trying to fetch public user information from Wikipedia using API. (Using the script get_pages_revisions.py). After I got the revisions, I used BeautifulSoup to strip all the HTML tags. However, I found the remaining text is still quite messy.
For example, when I fetched the textual data from the User:(aeropagitica), the results showed the following:
(A small part of it)
{{administrator}}
{{divbox|gray||Wikipedia is currently working on {{NUMBEROFARTICLES}} articles. The local time at the Wikipedia servers is '''{{CURRENTTIME}}''' on {{CURRENTDAYNAME}} {{CURRENTDAY}} {{CURRENTMONTHNAME}}, {{CURRENTYEAR}}.}}
• '''[[:WP:AIV|AIV]]''' •
'''[[Wikipedia:Articles for deletion/Log/{{CURRENTYEAR}} {{CURRENTMONTHNAME}} {{CURRENTDAY}}|AfD]]''' • '''[[User:(aeropagitica)/RFA summary|RfA]]''' • '''[[:Category:Candidates for speedy deletion|CSD]]''' • '''[[Wikipedia:Template messages|tpl]]''' • '''[[Wikipedia:Template_messages/User_talk_namespace|user talk tpl]]''' • '''[[Special:Newpages|new]]''' • '''[[Wikipedia:Stubs|stubs]]''' • '''[[Wikipedia:Copyright problems|(c)]]''' • '''[[Wikipedia:Manual of Style|MoS]]''' • '''[[User:Interiot/Tool2|edits (interiot)]]''' • '''[[Wikipedia:Proposed_deletion|prod]]''' • '''[[Special:Log/Newusers|newusers]]''' • '''[http://tools.wikimedia.de/~essjay/edit_count/Count.php? PHP interiot's tool]''' • '''[http://tools.wikimedia.de/~interiot/cgi-bin/Tool1/wannabe_kate Interiot's tool 1]''' • '''[[:Wikipedia:Article Creation and Improvement Drive|Article Improvement]]'''
{{purge|Purge server cache}}
I was [[Wikipedia:Requests_for_adminship/%28aeropagitica%29|nominated for adminship]] by [[User:King of Hearts|King of Hearts]] on February 27th 2006. The vote achieved consensus and I was accepted for the role with a score of '''40/10/5''' on March 7th 2006.
When I am not working on Wikipedia pages, I enjoy learning to play acoustic fingerstyle guitar, photography, learning languages (Spanish and French) and travel.
''Userboxes''
{| style="text-align:center; border: 1px solid #000000; background-color:#00cc99; width:100%; -moz-border-radius: 15px;"
|- padding:5em;padding-top:0.5em;"
|{{user en}}
May I ask:
How can I remove the string like style="....", cellpadding="...." or something like these here? Can I remove all the format strings like these at once?
There are many blocks like this:
{{Userbox|#77E0E8|#D0F8FF|{{CURRENTDAY}}|It is currently a [[{{CURRENTDAYNAME}}]]. I don't like {{CURRENTDAYNAME}}s.}}
The information after "It is .." is what we need, but the text before it: Userbox|#77E0E8, is also used for the web layout definition and should be removed. Is there any way we can remove the first half of this line?
(Userbox is just one kind of it, there are many other types like User:, Category:, hence it will be quite hard to move them with customize re rules)
(I'm a beginner of BeautifulSoup and Web Parser, so any suggestions or hints will be valuable. Thank you for your help in advance!)
You're using the Revisions API which only allows you to get the page content as Wikitext. That's the "messy" text you're seeing.
You can instead use the Parse API to get the rendered HTML content of the page, which you can then put into a local DOM parser of your choosing or just strip HTML tags if that works for you.
See the MediaWiki API documentation for details, including examples on how to request the parsed contents of a page.

Scraping for a specific title using Nokogiri in Ruby

I'm currently practicing web scraping using the NYT Best Sellers website. I want to get the title of the #1 book on the list and found the HTML element:
<div class="book-body">
<p class="freshness">12 weeks on the list</p>
<h3 class="title" itemprop="name">CRAZY RICH ASIANS</h3>
<p class="author" itemprop="author">by Kevin Kwan</p>
<p itemprop="description" class="description">A New Yorker gets a surprise when she spends the summer with her boyfriend in Singapore.</p>
</div>
I'm using the following code to grab the specific text:
doc.css(".title").text
However, it returns the titles of every book on the list. How would I go about getting just the specific book title, "CRAZY RICH ASIANS"?
If you look at the return from doc.css(".title") you will see it is a collection of all the titles. As Nokogiri::XML::Element Objects
CSS to my knowledge does not have a selector for targeting the first element of a given class. (Someone may certainly correct me if I am wrong) but to get just the first element from a Nokogiri::XML::NodeSet is still very simple as it acts like an Array in many cases. For Example:
doc.css(".title")[0].text
You could also use xpath to select just the first one (since XPath does support index based selection) like so:
doc.xpath(doc.xpath("(//h3[#class='title'])[1]").text
Please Note:
Ruby indexes start at 0 as in the first example;
XPath indexes start at 1 as in the second example.

Accessibility and asterisks end notes

I have a lengthy document that I need to convert to WCAG AA compliant HTML and the author used asterisks as end notes like *, **, and ***. Here is an example:
Solano County Atlas Fire (Solano County) 10/17*
Ponderosa Fire (Butte County) 08/17*
Helena Fire (Trinity County) 08/17*
...
* For taxable years beginning on or after January 1, 2014...
NOTE: multiple * asterisks in the above list all point to a single end note reference making it difficult to use traditional footnote practices.
Below is my attempt to solve this using aria-describedby, however JAWS does not read the description asterisks or the description at all.
Solano County Atlas Fire (Solano County) 10/17<span aria-describedby="dd1">*</span>
...
<p id="dd1">* For taxable years beginning on or after January 1, 2014...</p>
After some research it appears that JAWS and some other screen readers do not announce some punctuation at all. So I am not sure how exactly to go about fixing this. I am not really at liberty to change the characters unless there is no other possible solution.
What would you suggest I do to fix this?
I'd suggest to do a simple thing, as it is done in quite a number of books in HTML I read.
Make your asterisk a link with an Id, i.e., <a id="linkToNote1" href="#note1">*</a>.
Add a heading called "Notes" to the end of the document.
Add notes either as list items or as paragraphs with corresponding Ids (in this case, note1).
The most important: End each note with a link leading back to the appropriate place in the text, i.e.: Go back.
If however you want to use ARIA, do not mark the only asterisk with aria-describedby since you're right in that many users set their punctuation level quite low so they are not distracted by those signs in a lengthy text.

What HTML tag is semantically suitable for an "alternating" list with "intervals"?

Let's assume I want to present a TODO list such as this:
8:00
Walk the dog!
8:30
Clean the car!
9:20
Rob the bank!
9:21
Run the dog!
22:00
Have a beer!
22:05
My first though was to use a definition list, but it's not a great fit:
<dl>
<dt>8:00</dt>
<dd>Walk the dog!</dd>
<dt>8:30</dt>
<dd>Clean the car!</dd>
<dt>9:20</dt>
<dd>Rob the bank!</dd>
<dt>9:21</dt>
<dd>Run the dog!</dd>
<dt>22:00</dt>
<dd>Have a beer!</dd>
<dt>22:05</dt>
<!--<dd>Missing element???</dd>-->
</dl>
AFAIK, a <dt> must be followed by one or more <dd> tags. But the last time (22:05) does not have any (non-empty-)element to follow it.
Furthermore, the points in time given semantically should correspond just as much to the element before it as the one after it, but that relationship is lost here too.
Is there any other combination of HTML tag(s) that might fit this data better?
HTML can’t convey this. You have to make the durations/intervals explicit (i.e., with natural language).
While the time element can be used for durations, the duration has not a specific date as start/end. So you can convey that walking the dog takes 30 minutes, but not that you walk the dog from 8:00 to 8:30. By nesting time elements, you can (at most) hint at that.
The equivalent for non-time-related values is the data element, but there are no standardized formats, so you can’t convey that a value represents sea depth or a sea depth interval.
So with dl it could look like this:
<dl>
<dt><time datetime="12h 39m"><time>09:21</time> to <time>22:00</time></time></dt>
<dd>Run the dog!</dd>
<dt><time datetime="5m"><time>22:00</time> to <time>22:05</time></time></dt>
<dd>Have a beer!</dd>
</dl>
(This would also allow you to skip times, e.g., in case you didn’t really run the dog from 09:21 to 22:00.)
A semantically weaker variant (e.g., if you don’t want to provide the end time in the same element) would be something like this:
<dt><time datetime="5m"><time>22:00</time></time></dt>
In that case, you would have to provide an empty dd element for the last time (or whatever makes sense in your case).

How To (Semantically) Mark Up A (Theatre) Script / Play in HTML5

How To (Semantically) Mark Up A (Theatre) Script / Play in HTML5?
For obvious reasons, it's hard to search for "play" and "script" without a search engine thinking you mean “play a sound" and “JavaScript".
How can I mark up a script (as in the document one would give to actors in a play) such that it is semantically correct, and easy to style?
For example, let's take the start of Hamlet
Hamlet
ACT I
SCENE I Elsinore. A platform before the castle.
[FRANCISCO at his post. Enter to him BERNARDO]
BERNARDO Who's there?
FRANCISCO Nay, answer me: stand, and unfold yourself.
Fairly obviously, I think, one should start with
<h1 id="title">Hamlet</h1>
<h2 id="act-1">Act 1</h2>
<h3 id="scene-1">Scene 1</h3>
But, then I get stuck.
I've tried looking at MicroData, but Schema.org's CreativeWork[0] really doesn't contain much that would be useful in the case of a work of fiction.
Is it enough just to say
<p class="stage-direction">FRANCISCO at his post. Enter to him BERNARDO</p>
<p id="1"><span class="character bernardo">BERNARDO</span>Who's there?</p>
<p id="2"><span class="character francisco">FRANCISCO</span>Nay, answer me: stand, and unfold yourself.</p>
Or is there a better / more sensible way of doing things?
[0]http://schema.org/CreativeWork
It seems that the idea of precisely specifying markup for dialogue has been abandoned, and the W3C now simply offers some guidelines which pretty much equate to your idea of using paragraphs and spans.
Note that the dl element, which older sources - including the spec - had formerly recommended, should now definitely not be used: "The dl element is inappropriate for marking up dialogue".
But of course all this might change next week, or month, or year…
Does this provide any inspiration? caesar in xml