I want to learn web-scraping. Therefore, I started practicing. I am trying to get data-ad-id from HTML using XPath.
HTML structure like this:
<body id="z1234">
<div class="viewport">
<div class="g-row">
<div class="g-col-9">
<div class="cBox cBox--content cBox--resultList">
<div class="cBox-body cBox-body--resultitem dealerAd rbt-reg rbt-no-top"><a class="link--muted no--text--decoration result-item" href="url" data-ad-id="248059713"></a>
</div>
</div>
</div>
</div>
</body>
XPath for <a class="link--muted no--text--decoration result item" > is //*[#id="z1234"]/div[3]/div[4]/div[2]/div[1]/div[11]/a. If I choose different car, only last div changes.
According to this I write C# code:
var url = "https://suchen.mobile.de/fahrzeuge/search.html?damageUnrepaired=NO_DAMAGE_UNREPAIRED&isSearchRequest=true&maxPowerAsArray=KW&maxPrice=10000&minPowerAsArray=KW&minPrice=10000&scopeId=C";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream());
string sourceCode = sr.ReadToEnd();
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(sourceCode);
var rows = document.DocumentNode.SelectNodes("//*[#id='z1234']/div[3]/div[4]/div[2]/div[1]/div[11]");
foreach (var row in rows)
{
var id = row.SelectSingleNode("a[#data-ad-id]").InnerText;
Console.WriteLine("id:" + id);
}
}
I cannot get anything from this Node. It is null. How can I get data-ad-id?
EDIT
I change my C# code:
var rows = document.DocumentNode.SelectNodes("//a[#data-ad-id]")[0];
var id = rows.Attributes["data-ad-id"].Value;
Now I can get data-ad-id.
As per the code of the site, I could sense that you have no innertext for that "A" tag. It just contains DIV and IMG tags.
You will need to fetch data-ad-id using
//a[#data-ad-id]/#data-ad-id
Related
How do I get value of attribute from html code?
It´s an atribute that i would get: data-rbd-draggable-id="10061894-9a62-4ce5-adfa-ea3b879e15eb"
A html example:
<div data-rbd-draggable-context-id="0" data-rbd-draggable-id="10061894-9a62-4ce5-adfa-ea3b879e15eb" style="position: static;">
<div class="collapse-group category-item-list-item__collapse-group-link">
for item in all_items:
element = {}
idproduct = await item.query_selector('data-rbd-draggable-id')
element['id'] = await idproduct.inner_text()
name = await item.query_selector('.category-list-item-description__name')
element['nome'] = await name.inner_text()
elements.append(element)
<a className="stats__back" href="./..">
🠔
</a>
<div className="profile-heading">ACCOUNT</div>
<div className="profile-header">
<div className="profile-avatar">
<img className="account-avatar" src={`./images/avatar${avatar}.png`} alt="User's avatar" width="150" height="150" />
</div>
<div className="profile-header-info">
<div className="profile-username">{userName}</div>
<div className="profile-creation-date">
This is the part of the code I'm working on, and I'm trying to get access to div with className="profil-username" for unit test.
Here is how my test looks:
test('New user name is set after user name edition.', () => {
act(() => {
ReactDOM.render(<Account/>, container);
});
let profileUserName = container.querySelector("div.profile-username");
let editButton = container.querySelector('.edition-text');
let editUserName = container.querySelector('.account_modal-nick-input');
let okButton = container.querySelector('.modal-button-save');
let newUserName = "newUserName";
fireEvent.click(editButton);
fireEvent.change(editUserName, {target: {value: newUserName}});
fireEvent.click(okButton);
expect(profileUserName.value).toBe(newUserName);
});
So generally speaking I'm totally new to react and generally unit tests, and my final question is: How to get this particularry div using querySelector and how to call his value after this, is this just by writing divContainerVariableName.value or something else
Thank you in advance
const profileUserName = container.querySelector("div.profile-username");
expect(profileUserName.textContent).toBe(newUserName);
If you have more than one .profile-username, it will get the first one appears in the dom tree.
You should first check if you have targeted the correct dom element first, then consider getting it's text context.
I am trying to build a scraper and I would need some help with the following:
I would like to grab a bunch of data from an a-tag and some divs/spans nested in the same div.
My code look like this:
page = Nokogiri::HTML(open(website))
page.search('.company').each { |e| companies << e.text.strip }
page.search('.jobtitle').each { |e| jobtitles << e.text.strip }
page.search('.location').each { |e| locations << e.text.strip }
page.xpath('//a[#class="turnstileLink"]').map{ |e| links << e['href'] }
For the first three (company, title and location) I get either 16 or 15 results, but for the last search my array only contains 10 elements. Weirdly its they also dont match the first 10 of one of the other arrays, but rather start matching somewhere around the 3rd or 4th element of one of the other arrays.
The html of a typical card that I would like to target is here:
<div class="row result clickcard" id="pj_81c3e09223cbc6b3" data-jk="81c3e09223cbc6b3" data-advn="4563763653116462" data-tu="">
<a target="_blank" id="sja1" data-tn-element="jobTitle" class="jobtitle turnstileLink" href="/pagead/clk?mo=r&ad=-6NYlbfkN0DhDTzlYIMy8YIuVE6IrMC_kH05KGZgoAT6LTrcTn8STrwXoiuruouegXiAvJy4qud6xIecRibm3b0Q5eOBkpCiV3R04sAyQbvP7gt6NKZVpCRp32eFzXudmk-TIABX3xEZGo90a47Vz9OofqZaLDh37545RNQ3sFjM6VzWNEWwKf_YoXxeGKcAICj9AADyBuYAY7p9UIUxoox7J5U9gO8Zo2dvRW-i5FJtaUr49Vjsl04W0Jp-CN2azbfp6rrfT6RYFbJ_YAc2iI-L37eeygDtI4KXQwv_elrV8ZLEKo9rkcfEzbE129kX7JKeEq5wJ1dj7GJ4ONH1lIPJQd1gJLoqNYJVQlLTKJiBP72Z0RBmgfZQ-69U8AoEyMT6pytz6iqykLCnO-SxClmvFPJsNV96oBGzpMWtWQeVgGQ49jZfBBRq9Ubw7N73iEjCv6oQ70hcW1P4d8DYK0pCI7vu2KfUh0P9vx8AKC6wY2QoAZeeP4OiBIJ8ikKSIUYJTbe3UwKcLYP7r_3_rx1gY_JO1ReG21ctCxfqGH9DnqTSjz3SYCMZ2ZekooXa&vjs=3&p=1&sk=&fvj=1" title="Private Care Jobs With Elder - Immediate Start - £550 to £750 pw" rel="noopener nofollow" onmousedown="sjomd('sja1'); clk('sja1');" onclick="setRefineByCookie([]); sjoc('sja1',0); convCtr('SJ')">Private Care Jobs With Elder - Immediate Start - £550 to £75...</a>
<br>
<div class="sjcl">
<span class="company">
Elder</span>
<span class="location">London</span>
</div>
<div class="">
<table cellpadding="0" cellspacing="0" border="0"><tbody><tr><td class="snip">
<span class="summary">
Pass a full DBS check or have a valid check already. Access to the internet and a smartphone. At Elder, we’re looking for caring individuals to join our...</span>
</td></tr></tbody></table>
</div>
<div class="sjCapt">
<div class="result-link-bar-container">
<div class="result-link-bar"><span class=" sponsoredGray ">Sponsored</span> - <span id="tt_set_10" class="tt_set"><a id="sj_81c3e09223cbc6b3" href="#" class="sl resultLink save-job-link " onclick="changeJobState('81c3e09223cbc6b3', 'save', 'linkbar', true, ''); return false;" title="Save this job to my.indeed">save job</a></span><div id="editsaved2_81c3e09223cbc6b3" class="edit_note_content" style="display:none;"></div><script>if (!window['sj_result_81c3e09223cbc6b3']) {window['sj_result_81c3e09223cbc6b3'] = {};}window['sj_result_81c3e09223cbc6b3']['showSource'] = false; window['sj_result_81c3e09223cbc6b3']['source'] = "Indeed"; window['sj_result_81c3e09223cbc6b3']['loggedIn'] = false; window['sj_result_81c3e09223cbc6b3']['showMyJobsLinks'] = false;window['sj_result_81c3e09223cbc6b3']['undoAction'] = "unsave";window['sj_result_81c3e09223cbc6b3']['jobKey'] = "81c3e09223cbc6b3"; window['sj_result_81c3e09223cbc6b3']['myIndeedAvailable'] = true; window['sj_result_81c3e09223cbc6b3']['showMoreActionsLink'] = window['sj_result_81c3e09223cbc6b3']['showMoreActionsLink'] || false; window['sj_result_81c3e09223cbc6b3']['resultNumber'] = 10; window['sj_result_81c3e09223cbc6b3']['jobStateChangedToSaved'] = false; window['sj_result_81c3e09223cbc6b3']['searchState'] = "l=London&start=20"; window['sj_result_81c3e09223cbc6b3']['basicPermaLink'] = "https://www.indeed.co.uk"; window['sj_result_81c3e09223cbc6b3']['saveJobFailed'] = false; window['sj_result_81c3e09223cbc6b3']['removeJobFailed'] = false; window['sj_result_81c3e09223cbc6b3']['requestPending'] = false; window['sj_result_81c3e09223cbc6b3']['notesEnabled'] = false; window['sj_result_81c3e09223cbc6b3']['currentPage'] = "serp"; window['sj_result_81c3e09223cbc6b3']['sponsored'] = true;window['sj_result_81c3e09223cbc6b3']['showSponsor'] = true;window['sj_result_81c3e09223cbc6b3']['reportJobButtonEnabled'] = false; window['sj_result_81c3e09223cbc6b3']['showMyJobsHired'] = false; window['sj_result_81c3e09223cbc6b3']['showSaveForSponsored'] = true; window['sj_result_81c3e09223cbc6b3']['showJobAge'] = true;</script></div></div>
<div class="tab-container">
<div class="sign-in-container result-tab"></div>
<div class="tellafriend-container result-tab email_job_content"></div>
</div>
</div>
</div>
All cards have the same class ".clickcard" and all the relevant links have the class ".turnstileLink" but I cant seem to get consistent results when i try to page.search or page.xpath them, without having a problem matching up the data from all the different arrays correctly, besides the different number of elements I get returned.
So my question is: If I want to scrape the company name, location, job title, the url to that page and possibly another value, how would I best go about this?
I would appreciate any feedback!
Edit:
The contains() expression needs to be more complex:
contains(
concat(' ',normalize-space(#class),' '),
' turnstileLink '
)
to prevent classes like turnstileLinkerCar from matching. It's such a hassle that I would use doc.css() with a css selector like a.turnstileLink, which takes care of matching exactly the specified class name in a string that may have multiple class names.
Try:
doc.xpath('//a[contains(#class, "turnstileLink")]').each{ |e| links << e['href'] }
Or:
doc.css('a.turnstileLink').each{ |e| links << e['href'] }
Here's the problem:
require 'nokogiri'
my_html = %q{
<html>
<body>
A link
B link
C link
D link
</body>
</html>
}
doc = Nokogiri::HTML(my_html)
links = doc.xpath('//a[#class="c1"]').map{ |e| e["href"] }
p links
--output:--
["aaa"]
The class of the bbb link is "c1 c2" which is not equal to "c1".
Response to comment:
require 'nokogiri'
my_html = %q{
<html>
<body>
<div class="x">
A link
B link
C link
<div>
D link
</div>
</div>
<div class="y">
Y link
</div>
</body>
</html>
}
doc = Nokogiri::HTML(my_html)
links = doc.css('a.c1').map{ |e| e["href"] }
p links
--output:--
["aaa", "bbb", "ccc", "ddd", "yyy"]
But:
links = doc.css('div.x a.c1').map{ |e| e["href"] }
p links
--output:--
["aaa", "bbb", "ccc", "ddd"]
The same thing with xpaths:
links = doc.xpath('//div[contains(#class, "x")]//a[contains(#class, "c1")]').map{ |e| e["href"] }
plinks
--output:--
["aaa", "bbb", "ccc", "ddd"]
image of div tree
I am trying to scrape data from a table in a web page using htmlagilitypack.
Below is the html portion
<div id="table-matches" style="display: block;"><table class=" table-main"><colgroup><col width="50"><col width="*"><col width="50"><col width="50"><col width="50"><col width="50"><col width="50"></colgroup><tbody><tr class="dark center" xtid="28575"><th class="first2 tl" colspan="3"><a class="bfl" href="/hockey/usa/"><span class="ficon f-200"> </span>USA</a><span class="bflp">»</span>ECHL</th><th>1</th><th>X</th><th>2</th><th xparam="Number of available bookmakers odds~2">B's</th></tr><tr class="odd deactivate" xeid="pn36Jn1f"><td class="table-time datet t1496703900-1-1-0-0 ">04:35</td><td class="name table-participant">South Carolina Stingrays - <span class="bold">Colorado Eagles</span><span class="ico-event-info" onmouseover="toolTip('Colorado Eagles wins series 4-0. 4th leg.', this, event, '4');allowHideTootip(false);delayHideTip(200);return false;" onmouseout="allowHideTootip(true);delayHideTip(200);"> </span></td><td class="center bold table-odds table-score">1:2</td><td class="odds-nowrp" xodd="1.91" xoid="E-2nrdfxv464x0x6av8v">1.91</td><td class="odds-nowrp" xodd="4.74" xoid="E-2nrdfxv498x0x0">4.74</td><td class="odds-nowrp result-ok" xodd="2.79" xoid="E-2nrdfxv464x0x6av90">2.79</td><td class="center info-value">1</td></tr><tr class="dark center" xtid="28308"><th class="first2 tl" colspan="3"><a class="bfl" href="/hockey/usa/"><span class="ficon f-200"> </span>USA</a><span class="bflp">»</span>NHL</th><th>1</th><th>X</th><th>2</th><th xparam="Number of available bookmakers odds~2">B's</th></tr><tr class="odd deactivate" xeid="EyxiHGE4"><td class="table-time datet t1496707200-1-1-0-0 ">05:30</td><td class="name table-participant"><span class="bold">Nashville Predators</span> - Pittsburgh Penguins<span class="ico-event-info" onmouseover="toolTip('Series tied 2-2. 4th leg.', this, event, '4');allowHideTootip(false);delayHideTip(200);return false;" onmouseout="allowHideTootip(true);delayHideTip(200);"> </span></td><td class="center bold table-odds table-score">4:1</td><td class="odds-nowrp result-ok" xodd="2.15" xoid="E-2ns9hxv464x0x6b2jp">2.15</td><td class="odds-nowrp" xodd="3.86" xoid="E-2ns9hxv498x0x0">3.86</td><td class="odds-nowrp" xodd="2.91" xoid="E-2ns9hxv464x0x6b2jq">2.91</td><td class="center info-value">55</td></tr></tbody></table></div>
I have been trying combination of properties to access the data within table but all i get is the initial node containing div.
Here is the code used by me
var html = #urlOddsportal;
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
//var html = new HtmlAgilityPack.HtmlDocument();
//html.LoadHtml(new WebClient().DownloadString(urlOddsportal)); // load a string
var root = htmlDoc.DocumentNode;
var node = root.SelectSingleNode("//div[#id='table-matches']"); //this returns non null
// all of the below functions return null value
var rows = node.SelectNodes(".//tr[#class='odd deactivate']");
var table = root.SelectSingleNode("//table[#class=' table-main']");
var tablerows = node.SelectNodes(".//table/tbody/tr[1]");// [#class='odd deactivate']");
var tabletag = htmlDoc.DocumentNode.SelectNodes("//table[#class='table-main']");
Can someone tell me where i am going wrong.
Thanks
Does this return var = table ?
var table = document.DocumentNode.Descendants("table").FirstOrDefault(_ => _.HasProperty("class", " table-main")
Has property =
public static bool HasProperty(this HtmlNode node, string property, params string[] valueArray)
{
var propertyValue = node.GetAttributeValue(property, "");
var propertyValues = propertyValue.Split(' ');
return valueArray.All(c => propertyValues.Contains(c));
}
If it does work you can try it out on the other nodes returning null
I prefere to use this method as it is easier to read than xcode formulas
I would like to be able to be able to select each item is an individual list. look at this HTML:
<div class="services">
<a class="service selected" onclick="serviceNameClick('');" href="#">all</a>
<a class="service" onclick="serviceNameClick('12');" href="#">12</a>
<a class="service" onclick="serviceNameClick('14');" href="#">14</a>
<a class="service" onclick="serviceNameClick('14C');" href="#">14C</a>
<a class="service" onclick="serviceNameClick('N14');" href="#">N14</a>
<a class="service" onclick="serviceNameClick('14B');" href="#">14B</a>
<a class="service" onclick="serviceNameClick('27');" href="#">27</a>
<a class="service" onclick="serviceNameClick('12A');" href="#">12A</a>
<a class="service" onclick="serviceNameClick('27C');" href="#">27C</a>
<a class="service" onclick="serviceNameClick('N12');" href="#">N12</a>
<a class="service" onclick="serviceNameClick('14A');" href="#">14A</a>
</div>
To be able to display this as a list like:
all 12 14 N14 14B 27 12A 27 N12 14A
I am able to get there using the code below:
string htmlPage = "";
using (var client = new HttpClient())
{
htmlPage = await client.GetStringAsync("http://m.buses.co.uk/stop.aspx?stopid=" + stopIdVariable);
}
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlPage);
List<Movie> movies = new List<Movie>();
foreach (var a in htmlDocument.DocumentNode.SelectNodes("//div[starts-with(#class, 'content')]"))
{
Movie newMovie = new Movie();
//newMovie.Cover = div.SelectSingleNode(".//div[#class='image']//img").Attributes["src"].Value;
// newMovie.Title = div.SelectSingleNode(".//h4[#itemprop='name']").InnerText.Trim();
// newMovie.Summary = div.SelectSingleNode(".//div[#class='outline']").InnerText.Trim();
newMovie.Summary = a.SelectSingleNode("div[starts-with(#class, 'service')]").InnerText.Trim();
movies.Add(newMovie);
}lstMovies.ItemsSource = movies;
This displays it in a list, but I am unable to select individual items on that result,
It makes me select it all as a list and not as each one.
Also the aim to be able to select that and then use that value as a text field. So user clicks on 12 and then I use that 12 within the app.
What needs to be change? Thanks
You can try this way :
var links = htmlDocument.DocumentNode
.SelectNodes("//div[starts-with(#class, 'content')]//a[#class='service']")
foreach (var a in links)
{
Movie newMovie = new Movie();
newMovie.Summary = a.InnerText.Trim();
movies.Add(newMovie);
}
lstMovies.ItemsSource = movies;
Basically, the XPath passed as argument of SelectNodes() above select individual <a> nodes having class equals "service" (you can change the class checking using starts-with() or contains() if necessary).