HTML: How to refer to span.title inside a class? - html

I am building a webscraper and I have this block of HTML code:
<div class = 'example-1'
<ul class = 'example-2'
<li>
<span title = 'data1' > 155 </span>
/
<span title = 'data2' > 155 </span>
And I want to scrape the numbers 155 and 145 inside the span title
In my code using scrapy, I identified this as:
'size': detail.css('ul.example-2 ::text').get(),
but it is not returning me anything. How do I fix this?

The correct CSS selectors are:
span[title="data1"]
span[title="data2"]
Alternatively, you can select both at the same time with:
span[title^="data"]
I am unfamiliar with scrapy syntax, but I believe your scrapy selector should look something like this:
response.css('span[title^="data"]::text').getall()
Further info:
In CSS, square brackets denotes the attribute selector.
You can select:
an element with an attribute : span[title]
an element with a specific attribute-value : span[title="data1"]
an element with the start pattern of an attribute-value : span[title^="data"]
an element with the end pattern of an attribute-value : span[title$="1"]
and more.

Related

Built In Method in Typescript to Find Class inside a Tag Name or a Sibling of a Class

Here's the sample html:
<div aria-atomic="true" aria-live="polite" class="sr-only">
</div>
<sl-render ng-reflect-image-url="xxx">
<div class="imageBackground" tabindex="0">
<button class="OpenButton" tabindex="0">Open</button>
</div>
<div aria-atomic="true" aria-live="polite" class="sr-only">
</div>
</sl-render>
As you can see, there are multiple sr-only element in the example, but I only want the sr-only that's inside sl-render tag. Here's a not-so-clean solution:
query document.getElementsByTagName('sl-render')
based on 1, query another document.getElementsByClassName('.sr-only') since sr-only is inside sl-render tag
I am looking for a cleaner solution than the above, perhaps a built-in function to find sr-only class that is below imageBackground class?
Use querySelector or querySelectorAll with the selector sl-render > div.sr-only.
sl-render > div.sr-only will select all <div class="sr-only"> which is an immediate child of any <sl-render> elements.
TypeScript doesn't (yet) support type-safe results of querySelector but this is one of situations where using as is fine:
querySelector's return-type should be refined to HTMLElement | null or a subtype there-of (e.g. HTMLDivElement | null.
querySelectorAll's return-type should be refined to NodeListOf<HTMLElement> (there is no need for the type-union with | null as it's a collection-type).
Like so:
const srOnlyDivsInSLRenderElements = document.querySelectorAll( 'sl-render > div.sr-only' ) as NodeListOf<HTMLDivElement>;
for( const div of srOnlyDivsInSLRenderElements ) {
console.log( div.outerHTML );
}

how to get content within a span tag

#Example 1
<span class="levelone">
<span class="leveltwo" dir="auto">
::before
"Blue"
::after
</span>
</span>
#Example 2
<div class="itemlist">
<div dir="auto" style="text-align: start;">
"mobile"
</div>
</div>
#Example 3
<div class="quantity">
<div class="color">...</div>
<span class="num">10</span>
</div>
Hi, I am trying to use selenium to extract content from html. I managed to extract the content for example 1 & 2, the code that I have used is
example1 = driver.find_elements_by_css_selector("span[class='leveltwo']")
example2 = driver.find_elements_by_css_selector("div[class='itemlist']")
and printed out as text with
data = [dt.text for dt in example1]
print(data)
I got "Blue" for example 1 & "mobile" for example 2. For simplicity purposes, the html given above is for one iteration, I have scraped all elements with the class mentioned above
However, for the 3rd example, I tried to use
example3a = driver.find_elements_by_css_selector("div[class='quantity']")
and
example3b = driver.find_elements_by_css_selector("div[class='num']")
and
example3c = driver. find_element_by_class_name("num")
but all of it returned an empty list. I'm not sure is it because there is no dir in example 3? What method should I use to extract the "10"?
for 3rd example, you can try the below css :
div.quantity span.num
in code you can write like this :
example3a = driver.find_elements_by_css_selector("div.quantity span.num")
print(example3a.text)
or
print(example3a.get_attribute('innerHTML'))
To extract specifically the 10 you can use
example3a = driver.find_elements_by_css_selector("div.quantity span.num")
To extract both elements inside <div class="quantity"> you can use
example3 = driver.find_elements_by_xpath("//div[#class='quantity']//*")
for el in example3:
print(el.text)

Nokogiri HTML Nested Elements Extract Class and Text

I have a basic page structure with elements (span's) nested under other elements (div's and span's). Here's an example:
html = "<html>
<body>
<div class="item">
<div class="profile">
<span class="itemize">
<div class="r12321">Plains</div>
<div class="as124223">Trains</div>
<div class="qwss12311232">Automobiles</div>
</div>
<div class="profile">
<span class="itemize">
<div class="lknoijojkljl98799999">Love</div>
<div class="vssdfsd0809809">First</div>
<div class="awefsaf98098">Sight</div>
</div>
</div>
</body>
</html>"
Notice that the class names are random. Notice also that there is whitespace and tabs in the html.
I want to extract the children and end up with a hash like so:
page = Nokogiri::HTML(html)
itemhash = Hash.new
page.css('div.item div.profile span').map do |divs|
children = divs.children
children.each do |child|
itemhash[child['class']] = child.text
end
end
Result should be similar to:
{\"r12321\"=>\"Plains\", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", \"lknoijojkljl98799999\"=>\"Love\", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
But I'm ending up with a mess like this:
{nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"r12321\"=>\"Plains\", nil=>\" \", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"lknoijojkljl98799999\"=>\"Love\", nil=>\" \", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
This is because of the tabs and whitespace in the HTML. I don't have any control over how the HTML is generated so I'm trying to work around the issue. I've tried noblanks but that's not working. I've also tried gsub but that only destroys my markup.
How can I extract the class and values of these nested elements while cleanly ignoring whitespace and tabs?
P.S. I'm not hung up on Nokogiri - so if another gem can do it better I'm game.
The children method returns all child nodes, including text nodes—even when they are empty.
To only get child elements you could do an explicit XPath query (or possibly the equivalent CSS), e.g.:
children = divs.xpath('./div')
You could also use the children_elements method, which would be closer to what you are already doing, and which only returns children that are elements:
children = divs.element_children

How can I get list of elements or data which are on same level with same attributes?

I have one web application which have one HTML page.
In this page structure is like this:
<div class = 'abc'>
<div class = 'pqr'>test1</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>-</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>-</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>test2</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>-</div>
</div>
Here I want to take data from test1 to test2.
I have tried xpath with [Node Number] But I have found all nodes at [1] level.
Is there any way to get all data or List of elements test1 to test2 with "-" ?
I have seen this kind of issue before.
You have to use following-sibling here.
First I use this type of xpath :
//div[text()='test1']/..//following-sibling::div[#class='pqr' and not(contains(text(),'test'))]
Then you need to change script. "Note : I have written code in JAVA"
Logic :
while(element found text = '-')
{
//get data here
}
Please try this approach.
I guess you want the following xpath :
(//div[#class='pqr'])[position()<=4]
Notice the brackets () before position() predicate.
output in xpath tester :
Element='<div class="pqr">test1</div>'
Element='<div class="pqr">-</div>'
Element='<div class="pqr">-</div>'
Element='<div class="pqr">test2</div>'
I think you can't use the Test1 and Test2 elements as identifiers because they are on the same line as the nodes you want to collect. Otherwise, I think you can use findElements(by.Xpath("patern_to_search")). that will return you a collection of elements that are matching your pattern.
one more way without using xpath:
List<WebElement> element = driver.findElements(By.className("pqr"));
for(int i=0;i<element.size()-1;i++){
System.out.println(element.get(i).getText());
}

Selenium WebDriver how to verify Text from Span Tag

I'm trying to verify the text in the span by using WebDriver. There is the span tag:
<span class="value">
/Company Home/IRP/tranzycja
</span>
I tried something like this:
driver.findElement(By.xpath("//span[#id='/Company Home/IRP/tranzycja']'"));
driver.findElement(By.cssSelector("span./Company Home/IRP/tranzycja"));
but none of this work.
Any help would be really appreciated. Thanks
More code:
<span id="uniqName_64_0" class="alfresco-renderers-PropertyLink alfresco-renderers-Property pointer small" data-dojo-attach-point="renderedValueNode" widgetid="uniqName_64_0">
<span class="inner" tabindex="0" data-dojo-attach-event="ondijitclick:onLinkClick">
<span class="label">
In folder:
</span>
<span class="value">
/Company Home/IRP/tranzycja
</span>
</span>
uniqName shouldn't be a target because are a lot of them and they are change.
There is a full html code:
http://www.filedropper.com/spantag
Here I am assuming you are trying to verify the text in the span tag.
i.e '/Company Home/IRP/tranzycja'
Try Below code
String expected String = "/Company Home/IRP/tranzycja";
String actual_String = driver.findElement(By.xpath("//span[#class='alfresco-renderers-PropertyLink alfresco-renderers-Property pointer small']//span[#class='value']")).getText();
if(expected String.equals(actual_String))
{
System.out.println("Text is Matched");
}
else
{
System.out.println("Text is not Matched");
}
You can try using xpath ('some text' can be replaced by variable like #Rupesh suggested):
driver.findElement(By.xpath("//span/span[#class='value'][normalize-space(.) = 'some text']"))
or
driver.findElement(By.xpath("//span/span[#class='value'][contains(text(),'some text')]"))
(Be aware that this xpath will find first matching element, so if there are span elements with text 'some text 1' and 'some text 2', only first occurrence will be found.)
Of course, those two methods will throw NoSuchElementException if element (with defined text) is not found on page. If you're using Java and if needed, you can easy catch that error and print proper message.
One possible xpath to find that <span> element :
//span[normalize-space(.) = '/Company Home/IRP/tranzycja']
I think your going to want to use something like
driver.findElement(By.xpath("//span[#id='/Company Home/IRP/tranzycja'])).getText();
the getText(); will get the text within that span
You can use text() method inside Xpath. I hope this will resolve your problem
String str1 = driver.findElement(By.xpath("//span[text()='/Company Home/IRP/tranzycja']")).getText();
System.out.println("str1");
Output = /Company Home/IRP/tranzycja