How to get simple text from HTML page with goquery? - html

I am new to Go. I am using goquery to extract data from an HTML page.
But the problem is the data I am looking for is not bounded by any HTML tag. It is simple text after a <br> tag. How can I extract it?
Edit : Here is HTML code.
<div class="container">
<div class="row">
<div class="col-lg-8">
<p align="justify"><b>Name</b>Priyaka</p>
<p align="justify"><b>Surname</b>Patil</p>
<p align="justify"><b>Adress</b><br>India,Kolhapur</p>
<p align="justify"><b>Hobbies </b><br>Playing</p>
<p align="justify"><b>Eduction</b><br>12th</p>
<p align="justify"><b>School</b><br>New Highschool</p>
</div>
</div>
</div>
From this I want "Priyanka" and "12th".

The following is what you want:
doc.Find(".container").Find("[align=\"justify\"]").Each(func(_ int, s *goquery.Selection) {
prefix := s.Find("b").Text()
result := strings.TrimPrefix(s.Text(), prefix)
println(result)
})
import strings in front of your code. If you need complete code example, check here.

Try query for and get its siblings
http://godoc.org/github.com/PuerkitoBio/goquery#Selection.Siblings

Related

How to get a div or span class from a related span class?

I've found the lowest class: <span class="pill css-1a10nyx e1pqc3131"> of multiple elements of a website but now I want to find the related/linked upper-class so for example the highest <div class="css-1v73czv eh8fd9011" xpath="1">. I've got the soup but can't figure out a way to get from the 'lowest' class to the 'highest' class, any idea?
<div class="css-1v73czv eh8fd9011" xpath="1">
<div class="css-19qortz eh8fd9010">
<header class="css-1idy7oy eh8fd909">
<div class="css-1rkuvma eh8fd908">
<footer class="css-f9q2sp eh8fd907">
<span class="pill css-1a10nyx e1pqc3131">
End result would be:
INPUT- Search on on all elements of a page with class <span class="pill css-1a10nyx e1pqc3131">(lowest)
OUTPUT - Get all related titles or headers of said class.
I've tried it with if-statements but that doesn't work consistently. Something with an if class = (searchable class) then get (desired higher class) should work.
I can add any more details if needed please let me know, thanks in advance!
EDIT: Picture per clarification where the title(highest class) = "Wooferland Festival 2022" and the number(lowest class) = 253
As mentioned, question needs some more information, to give a concret answer.
Assuming you like to scrape the information in the picture based on your example HTML you select your pill and use .find_previous() to locate your elements:
for e in soup.select('span.pill'):
print(e.find_previous('header').text)
print(e.find_previous('div').text)
print(e.text)
Assuming there is a cotainer tag in HTML structure like <a> or other you would select this based on the condition, that it contains a <span> wit class pill:
for e in soup.select('a:has(span.pill)'):
print(e.header.text)
print(e.header.next.text)
print(e.footer.span.text)
Note: Instead of using css classes, that can be highly dynamic, try use more static attributes or the HTML structure.
Example
See both options, for first one the <a> do not matter.
from bs4 import BeautifulSoup
html='''
<a>
<div class="css-1v73czv eh8fd9011" xpath="1">
<div class="css-19qortz eh8fd9010">
<header class="css-1idy7oy eh8fd909">some date information</header>
<div class="css-1rkuvma eh8fd908">some title</div>
<footer class="css-f9q2sp eh8fd907">
<span class="pill css-1a10nyx e1pqc3131">some number</span>
<footer>
</div>
</div>
</a>
'''
soup = BeautifulSoup(html)
for e in soup.select('span.pill'):
print(e.find_previous('header').text)
print(e.find_previous('div').text)
print(e.text)
print('---------')
for e in soup.select('a:has(span.pill)'):
print(e.header.text)
print(e.header.next.text)
print(e.footer.span.text)
Output
some date information
some title
some number
---------
some date information
some date information
some number

Traversing the DOM with querySelector

I'm using the statement document.querySelector("[data-testid='people-menu'] div:nth-child(4)") in the console to give me the below HTML snippet:
<div>
<span class="jss1">
<div class="jss2">
<p class="jss3">Owner</p>
</div>
</span>
<div class="jss4">
<div class="5" title="User Title">
<p class="jss6">UT</p>
</div>
<div class="jss7">
<p class="jss82">User Title</p>
<span class="jss9">Project Manager</span>
</div>
</div>
</div>
I'd like to extend the statement in the console to extract the title "User Title" but can't figure out what combination of nth-child or nextSibling (or something else) to use. The closest I've gotten is:
document.querySelector("[data-testid='people-menu'] div:nth-child(4) span:nth-child(1)")
which gives me the span with class jss1.
I expected document.querySelector("[data-testid='people-menu'] div:nth-child(4) span:nth-child(1).nextSibling") to give me the div with class jss4, but it returns null.
I can't use class selectors because those are generated dynamically at build.
Why not just add [title] onto your querySelector?
document.querySelector("[data-testid='people-menu'] div:nth-child(4) [title]")
You can then get whatever you are looking for from that section? This is assuming title will be unique attribute in this section of html

jQuery - Insert text inside element of a HTML string

I store an html string into var HTML, which I get using the following:
var HTML = $('.group').get(0).outerHTML;
The output of HTML using console.log(HTML) is:
<div class="group">
<div class="class1">
Data123...
</div>
<div class="class2">
<!--I want to insert text here -->
</div>
</div>
Now, I want to insert some text inside the div class="class2". I am using the following code:
$(HTML).find('.class2').text("Hello!");
But now the output of HTML using console.log(HTML) is the same old HTML as before. The text "Hello!" did not get inserted. Can anyone help with the solution.
Here is the complete code:
<div class="group">
<div class="class1">
Data123...
</div>
<div class="class2">
</div>
</div>
<script type="text/javascript">
var HTML = $('.group').get(0).outerHTML;
$(HTML).find('.class2').text("Hello!");
console.log(HTML);
</script>
You're updating a temporary DOM element, but that doesn't change the HTML string. You need to save the DOM elements in a variable.
var new_div = $(HTML);
new_div.find('.class2').text("Hello!");
console.log($(new_div).html());

Why don't the String values get displayed as HTML tags using innerHTML?

I get string data from my database to my variable, I want to display them as HTML tags by [innerHTML], but it doesn't work.
The variable is displayed on string instead HTML Tags.
I tried to use with DomSanitizer but it don't work:
article:Article[];
(article.articlesTitleHtml:SafeHtml;)
in the function:
this.article.forEach(elementArticle => {
elementArticle.articlesTitleHtml = this.sanitizer.bypassSecurityTrustHtml(elementArticle.articleTitle)
});
in HTML page:
<div *ngFor="let item of articles">
<div id="{{item.articleId}}">
<h2 class="chart" [innerHTML]="item.articlesTitleHtml"></h2>
</div>
my code:
in Type Script:
articles:Article[];
ngOnInit() {
this.apiArticle.getArticleList().subscribe(data=>{
this.articles=data
})
in HTML page:
<div *ngFor="let item of articles">
<div id="{{item.articleId}}">
<h2 class="chart" [innerHTML]="item.articleTitle"></h2>
</div>
</div>
It should work, you can check here...
if you can share the type of data that you're dealing with, it will give more insight into the appropriate DomSanitizer method which should be called
in my example above, i used both bypassSecurityTrustHtml & bypassSecurityTrustUrl for the 2 different types of strings which needed sanitization

Display HTML code in text box

I am having a text box which I am filling from the Json response as below
<div class="gadget-body" style="height:100px">
<label>{{ textData}}</label>
</div>
But now my Json is returning html code with <p> and <h1> tags. I am binding the response, but it is displaying with <p> and <h1> tags instead of applying it.
The simple and easiest way is use innerhtml tag
<div class="gadget-body" >
<div [innerHTML]="textData">
</div>
</div>
Maybe have a function like this :
function htmlToPlaintext(text) {
return text ? String(text).replace(/<[^>]+>/gm, '') : '';
}
and then you'd use:
<div class="gadget-body" style="height:100px">
<label>{{ htmlToPlaintext(textData) }}</label>
</div>