JSoup: Accessing Data Within Multiple HTML classes - html

I recently started using JSoup to do HTML data scraping and I couldn't find enough detailed information on jsoup.org on how to find div classes that are nested within other div classes.
<div class="Food">
<a href="/eating/101" class="Eating">
<div class="Groceries">
<div class="Vegtables">
<div class="LeafyGreens"
<img src="https://RealisticBroccoli.svg" alt="" class="Broccoli-logo"></div>
<div class="Broccoli Fact">Fun Fact About Broccoli:</div>
</div></a></div>
<div class="Food">
<a href="/eating/102" class="Eating">
<div class="Groceries">
<div class="Vegtables">
<div class="LeafyGreens"
<img src="https://CartoonBroccoli.svg" alt="" class="Broccoli-logo"></div>
<div class="Broccoli Fact">Fun Fact About Broccoli:</div>
</div></a></div>
I created a simplistic version of a similar HTML project I am working on. I know it seems like there is an excessive amount of div tags, but its what is making this problem challenging for me. I wanted to scrape the HTML text for the Broccoli Fact that is produced when the A[href] is eating/101 without scraping the fact from eating/102.
From my experience I cannot scrape the "Broccoli Fact" class using one instruction, it doesn't produce any output either. I think that it has something to do with the a href "/eating/101". Thanks for the help!

From what I understand you may have one or both of these 2 problems:
1) In HTML it is possible to assign more than one class name to an element. This is what happens here with Broccoli Fact. These are actually two classes: Broccoli and Fact. In your CSS selector in Jsoup you may want to apply both classes as well. This can be achieved by simply concatenating the classes. Note that a class selector in JSopu CSS does contain the class-name with a preceding dot: .Broccoli and .Fact So concatenation gives you
div.Broccoli.Fact
2) In the HTML you give as an example there are more than 1 Broccoli Fact, but you only want to get the first one. There are several ways of dealing with this. Which one is the best is hard to tell without more context knowledge of your task. However, here are some suggestions:
a) gather all Fun Facts of Broccoli but only use the first one. Since Jsoup returns Elements, which implements the List interface, you can quite easily access the first element.
b) Use a more precise CSS selector. Something like this could work:
div.Food>a[href$=101] div.Broccoli.Fact
Have a look here to learn about JSoup CSS selectors: https://jsoup.org/cookbook/extracting-data/selector-syntax

Related

What is the correct HTML for displaying a statistic in the most accessible way?

Context
I am creating a component that displays important statistics that looks like this:
It will be used in a few contexts:
As part of a dashboard where there will be many of these components for different stats such as Twitter followers and Github stars.
It's also going to appear on its own within a blog post (which is about how this component is built).
Question
What would be the most appropriate HTML to make this component accessible? Do I need to use ARIA attributes at all?
My previous approach
I'm leaning towards using a figure element where the title, "Github followers" is the caption.
<figure>
<figcaption>Github followers</figcaption>
<span>10</span>
</figure>
My current approach
I've changed to using divs since I won't know all the contexts where this component is going to be used. Instead I've used the aria-labelledby attribute to associate the number with its label.
<div>
<div id="followers">Github followers</div>
<div aria-labelledby="followers">10</div>
</div>
Thanks to everyone who commented on this post. I'm going to try to answer this as best I can with the information I've gathered.
<div>
<div id="followers">Github followers</div>
<div aria-labelledby="followers">10</div>
</div>
This is part of a reusable React component, so I'm using divs instead of contextual/semantic HTML elements. To provide assistive technologies with useful information, I am using the ARIA labelledby attribute.
I've written about this in more detail on my blog if anyone is interested in the full solution I built: https://www.jamiedavenport.dev/blog/building-a-stat-card-component

selecting deep elements in d3

So for my question you can refer to udacity.com main page.
Im trying to access this text -"The Udacity Difference" somewhere on the middle of the page.
I tried this :
d3.select("div.banner-content.h2.h-slim")
d3.select("div.banner-content.h-slim")
None of the above is working. So I tried looking in dev-tools element section for this page.
Then I could hover and see that :
div.banner-content has further
{div.container
{div.row
{div.col-xs-12 text-center
{h2.h-slim}}}}
Then I thought ok I shoud try if I can get the "container" atleast, but then even this
d3.select("div.banner-content.div.container")
OR
d3.select("div.banner-content.container")
doesnt work !!!!
Wheres the fault in logic ?
You can find nested elements using d3's chained syntax as shown below.
HTML:
<div class="banner-content">
<div class="container">
<h2 class="h-slim">Header</h2>
</div>
</div>
Code:
d3.select("div.banner-content").select("div.container").select("h2.h-slim")
EDIT: No need to type such long syntax. Just need to separate the selectors using a space.
d3.select("div.banner-content div.container h2.h-slim")
Code below will also give you the same results in this case. You can specifically add tags if necessary.
d3.select("div.banner-content .container .h-slim")

Repeated content (sub-template) in AngularJS

I have a template which contains (in part) exactly the same content repeated two or three times with minor changes to the bindings, eg:
<div class="xyz-state0" data-ng-hide="data.error || !data.states[0].name">
<div class="xyz-content">
<img data-ng-src="{{data.states[0].image}}" width="48" height="48">
<span>{{data.states[0].name}}</span>
</div>
</div>
<div class="xyz-state1" data-ng-hide="data.error || !data.states[1].name">
<div class="xyz-content">
<img data-ng-src="{{data.states[1].image}}" width="48" height="48">
<span>{{data.states[1].name}}</span>
</div>
</div>
How do I write this to avoid duplicating this HTML? This is specific to its parent view (it won't be used anywhere else) so creating a full-blown widget seems wrong.
Basically I want something similar to ngRepeat, but I can't use that for the following reasons:
I need a specific (and different) style on each parent div.
I need to render a specific number of divs (2 in this case, 3 in another) regardless of whether or not they exist in the scope (ie. data.states could only have 1 element in it, but it still needs to create both divs).
In the other case the items need to be rendered out of order (first 1, then 0, then 2).
I've managed to get a template fragment in a separate HTML file and included it with ngInclude, but I don't know how to get a single name in its new scope to refer to a specific item. My first attempt was this, which doesn't work:
<div class="xyz-state0" data-ng-include="'state.tpl.html'" data-ng-init="state=data.state[0]"></div>
<div class="xyz-state1" data-ng-include="'state.tpl.html'" data-ng-init="state=data.state[1]"></div>
I suspect I could probably do it with a custom controller, but that seems like a heavy solution too. What's the Right Way™?
This is pretty much a textbook case for a custom directive. Define a directive, and then you can do
<state ng-repeat="item in data.states" item="item">.
Alternatively, if a custom directive is too much overkill (depends on whether you'll be reusing that view component elsewhere, mainly), you could just put an ng-repeat on the entire div. The only real issue is the class="xyz-stateN" stuff, but I bet you could hoke that up with ng-class usage.
EDIT:
if you do an ng-repeat, you can just use the $index key (as long as you're always counting up from zero and the state class is the same as the index). Something like
<div ng-class="{{'xyz-state'+$index}}" ng-repeat="state in data.states" data-ng-hide="data.error || !state.name">
<div class="xyz-content">
<img data-ng-src="{{state.image}}" width="48" height="48">
<span>{{state.name}}</span>
</div>
</div>
Would probably work fine. All that said, it's almost always worth making a directive in my opinion. Code gets recycled all the time, plus you can be cautious with namespacing and modularizing if that makes you nervous.
Well, this seems to do the trick (thanks to pfooti for the hint). I'm still not entirely happy with it as the directive is registered globally, whereas I really only want it in this one place.
state.tpl.html:
<div class="xyz-content" data-ng-show="state.name">
<img data-ng-src="{{state.image}}" width="48" height="48" />
<span>{{state.name}}</span>
</div>
view.tpl.html:
<div data-xyz-state="data.states[0]" class="xyz-state0"
data-ng-hide="data.error"></div>
<div data-xyz-state="data.states[1]" class="xyz-state1"
data-ng-hide="data.error"></div>
app.js:
app.directive('xyzState', [function() {
return {
templateUrl: 'state.tpl.html',
scope: {
state: '=xyzState',
},
};
}]);
Interestingly it doesn't work if I try to declare the introducing element as <xyz-state ...> instead of <div data-xyz-state="" ...>, despite the docs saying that this ought to work too. I assume there's some sort of validation thing interfering here.
Just as an FYI, I later revisited this code and decided to do it like this instead: (I'm letting my original answer stand as that is more like what I was originally asking for, and they both seem reasonable in different cases.)
view.tpl.html
<div data-ng-repeat="state in data.states" data-ng-if="!data.error"
data-ng-class="state.class">
<div class="xyz-content" data-ng-show="state.name">
<img data-ng-src="{{state.image}}" width="48" height="48" />
<span>{{state.name}}</span>
</div>
</div>
app.js
...
while ($scope.data.states.length < 2)
$scope.data.states.push({});
$scope.data.states[0].class = 'xyz-state1';
$scope.data.states[1].class = 'xyz-state2';
...
I've done something similar for the other (3-item) case, except there as I wanted to rearrange the order of the items I added an order property for the desired order in the controller and then used data-ng-repeat="button in data.buttons|orderBy:'order'" in the view.
This does mean that a bit of view definitions (display order and CSS classes) have leaked into the controller, but I think the benefit to code clarity outweighs that.

Regex: Recursively remove surrounding <div> tags in HTML

Before someone whines about not using regex to parse HTML, I refer you to a previous question of mine with an elegant solution which was quickly labeled as "answered" by another question's answer whining about not using regex to parse HTML (which was eventually removed from my question): Regex: Find groups of lowercase letters between HTML tag
I am again working with epubs (in Sigil), this time cleaning up the XHTML output from InDesign CC. Different from previous ID versions, it now surrounds many objects with extra <div> tags for some kind of positioning/layout reason. I am writing my own clean CSS, so the epub is exported without generating CSS, leaving extraneous <div> tags surrounding other <div>s, sometimes containing a nested structure of unnecessary <div>s.
An example of what I'm dealing with:
<div><!--unnecessary-->
<div class="figure-box">
<h4 class="f-n"><b class="b">Figure 1.3: Foobar</b></h4>
<div><!--unnecessary-->
<div class="figure">
<img alt="foo" src="../Images/bar.jpg"/>
</div>
</div>
<p class="f-ct">This is a caption, yadda yadda.</p>
<p class="f-src">Source: Copyright blah blah.</p>
</div>
</div>
Note: <!--unnecessary--> comments are illustrative, and are not present in the actual code.
I have written this regular expression in an attempt to remove the unstyled surrounding <div> tags with some success, but I'm hoping for a more elegant solution:
^(\s*)<div>\n\s*(<div class=".+?">.+?</div>)\n\1</div>
The above string matches the outermost <div>, where I can then do a replace with \1\2 to keep the contents and also the first indent (though the indent isn't absolutely necessary).
The issue with this is that I must do a find/replace all several times in order to get at and remove all of the unnecessary <div>s that are nested.
Is this as good as it will get, or is there a solution like the one I linked to above for this purpose?

Tags That Will Operate As Multiple Tags

I had a very hard time trying to word what I wish to know how to do, nor could I locate any post or website from Google that had my answer probably due to not being able to word this correctly, but I will explain in fullest detail.
<br />
<hr />
<br />
Break, horizontal, break is my way of separating parts of the post from another. How can I group the three into one simple tag that can replace the three, thus saving me time and hassle .
It would be also helpful to know if there are ways to define tag groupings with more than just empty tags like a tag identified by the string title1 would be a tag containing all the format, text, and all sub-elements of the template that was coded somewhere else.
If this question has already been posted then I am sorry. Thanks!
You don't need the <br> tags because <hr> is a block level element and automatically creates a line break. If you're doing that to get some vertical space above and below thw <hr> why not just use CSS to give the <hr> some margin?
hr
{
margin-bottom: 20px;
margin-top: 20px;
}
Neither <br> nor the proposed alternative <hr> are particularly well-suited here.
You need to learn about CSS. All you need to do is apply appropriate styles (i.e. a margin) to the elements that wrap your posts.
<div class="post">
<h1>Post #1</h1>
<p>something</p>
</div>
<div class="post">
<h1>Post #2</h1>
<p>something else</p>
</div>
div.post {
margin-bottom: 3em;
}
If you are using HTML5 then use <article> instead of <div class="post"> to denote individual posts.
As for grouping tags, this is currently not possible in plain HTML, you need to apply some preprocessing for that. The usual solution is to use a content management system which creates the final HTML based on your content and an HTML template.
Whilst this specific problem can be solved with a little bit of CSS, it sounds like you need a layout or templating engine of some sort in the long run. I'm a rubyist by trade so my go-to solution for doing this is Jekyll.
What Jekyll does is generate static html files from layouts and content that you write. You can abstract a lot of the repetitive layout markup into separate files and then just reference them when you need them.
The following guide is a good place to get started: http://net.tutsplus.com/tutorials/other/building-static-sites-with-jekyll/
If you're already working with another framework then do some reading around it first to see if there's something there you can use. If you're just writing straight-up HTML/CSS though, then definitely give Jekyll a try.