Obtain a text from a html code with BeautifulSoup

Obtain a text from a html code with BeautifulSoup - html

I've been trying to extract the text from the following code with BeautifulSoup in Python:
<a class="w-menu__link" href="https://www.universidadviu.es/grado-economia/">Grado en Economía</a>
I need to extract the text "Grado en Economía" from this and all other similar lines in the html code. For example:
<a class="w-menu__link" href="https://www.universidadviu.es/grado-derecho/">Grado en Derecho</a>
In this line I need to extract "Grado en Derecho".
I can extract the class and the href, but I don't know how to extract the rest of the text. I'm using the following code:
list_of_links_graus = []
html_graus = urlopen("https://www.universidadviu.es/grados-online-viu/") # Insert your URL to extract
bsObj_graus = BeautifulSoup(html_graus.read());
for link in bsObj_graus.find_all('a'):
list_of_links_graus.append(link.get('href'))
I would also ask if someone can please edit the title of this question in order to fit the real problem, since I'm not a html expert and I suppose I'm not extracting a simple text (as the title says).
Thanks to all in advance.

Use the text attribute
for link in bsObj_graus.find_all('a'):
list_of_links_graus.append((link.get('href'), link.text))

Related

Extracting full text from HTML span element with XPath expression

I have a HTML tree which looks like this:
<div id="RF4FOEQ3OPBEX" data-hook="review" class="a-section review aok-relative"><div
<div data-hook="review-collapsed" aria-expanded="false" class="a-expander-content reviewText review-text-content a-expander-partial-collapse-content">
<span>
Text line1.
<br>
Text line2.
</span>
I am trying to extract all the text from the span with the following XPath expression:
//div[#data-hook="review"]//div[#data-hook="review-collapsed"]/span/text()
However this approach only returns the first text line until the break? The question is: how would I approach this problem in the correct way in order to extract the full text content of the HTML span tag? I would appreciate any help very much and thank you in advance for the support.

use // and getall method to get all text inside specific element
getall returns list, just join it
txt = "".join(response.xpath('//div[#data-hook="review"]//div[#data-hook="review-collapsed"]/span//text()').getall())

Python - Beautifulsoup, differentiate parsed text inside of an html element by using internal tags

So, I'm working on an html parser to extract some text data from a list of and format it before giving an output. I have a title that I need to set as bold, and a description which I'll leave as it is. I've found myself stuck when I reached this situation:
<div class ="Content">
<Strong>Title:</strong>
description
</div>
As you can see the strings are actually already formatted but I can't seem to find a way to get the tags and the text out together.
What my script does kinda looks like:
article = "" #this is where I normally store all the formatted text, it's necessary that I get all the formatted text as one loooong string before I Output
temp1=""
temp2""
result = soup.findAll("div", {"class": "Content"})
if(result!=none):
x=0
for(i in result.find("strong")):
if(x==0):
temp1 = "<strong>" + i.text + "</strong>"
article += temp1
x=1
else:
temp2 = i.nextSibling #I know this is wrong
article += temp2
x = 0
print(article)
It actually throws an AttributeError but it's a wrong one since the output is "Did you call find_all() when you meant to call find()?".
I also know I can't just use .nextSibling like that and I'm litterally losing it over something that looks so simple to solve...
what I need to get is: "Title: description"
Thanks in advance for any response.
I'm sorry if I couldn't explain really well what I'm trying to accomplish but that's kind of articulated; I actually need the data to generate a POST request to a CKEditor session so that it adds the text to the html page, but I need the text to be formatted in a certain way before uploading it. In this case I would need to get the element inside the tags and format it in a certain way, then do the same with the description and print them one after the other, for example a request could look like:
http://server/upload.php?desc=<ul>%0D%0A%09<li><strong>Title%26nbsp%3B<%2strong>description<%2li><%2ul>
So that the result is:
Title1: description
So what I need to do is to differentiate between the element inside the tag and the one out of it using the tag itself as a refernce

EDIT
To select the <strong> use:
soup.select_one('div.Content strong')
and then to select its nextSibling:
strong.nextSibling
you my need to strip it to get rid of whitespaces, ....:
strong.nextSibling.strip()
Just in case
You can use ANSI escape sequences to print something bold, ... but I am not sure, why you would do that. That is something should be improved in your question.
Example
from bs4 import BeautifulSoup
html='''
<div class ="Content">
<Strong>Title:</strong>
description
</div>
'''
soup = BeautifulSoup(html,'html.parser')
text = soup.find('div', {'class': 'Content'}).get_text(strip=True).split(':')
print('\033[1m'+text[0]+': \033[0m'+ text[1])
Output
Title: description

You may want to use htql for this. Example:
text="""<div class ="Content">
<Strong>Title:</strong>
description
</div>"""
import htql
ret = htql.query(text, "<div>{ col1=<strong>:tx; col2=<strong>:xx &trim }")
# ret=[('Title:', 'description')]

Display text as html markup

I have a problem which is probably trivially easy but I can't seem to get it working. Using this post, I do a search using Regex in a text string to convert any links into html markup, but when it comes to display on the page it just displays like this:
this is link
<a href='http://www.google.com'>http://www.google.com</a>
In the view I have:
<p>#news.Body</p>
edit: great my question is now displaying how I want. So now to the actual question, how do I get the page displaying an actual link instead of the code when displayed to the user.

Use `` around your variable (e.g.)
Use "{}" icon in toolbar to insert code
Indent your code by one empty line, 4 spaces and leading empty line
E.g.:
Like this
You can edit this answer to see raw output

Finding Xpath - for text without HTML tag

<p class="result-price">
<span>Price</span>
$25.00 |
<span>Member Price</span>
$25.00
<span>(0% discount)</span>
</p>
<p class="result-rating">
From the above HTML tags u can notice that the $25.00 | is just a text and is not associated with any HTML tags, I wrote the following x-path to retrieve it :
//div[contains(#data-title,'Rafael B.: Arrangement and Composition')]/div[3]/p[1]/text()[2].
and it did extract the text but in the xpath checker the result is displayed inside a container.
when I use the same x-path in my script, its not retrieving the text value.
Can somebody please help.Looks like the text is inside a container/ text-area

Here a way to retrieve price :
//span[text()="Price"]/following-sibling::text()[1]
or
//p[#class="result-price"]/span[1]/following-sibling::text()[1]

Here the easiest way:
//p[contains(.,'$25.00')]
or you can find any text like below
//p[contains(.,'%replaceText%')]

It is not possible for selenium webdriver to extract something like this directly. Because, even if it might show as the xpath resulting in an element (when using Firepath), but still it will raise an InvalidSelectorException. So, I think the only way to get the text, is to get the innerHTML/text of the parent node and then use "substring()" or "split" to extract the required text.
As per above html snippet, you can use the below Java Code to get the output as "$25.00 |":-
String text = driver.findElement(By.xpath("//p[#class='result-price']")).getText().substring(6, 14); // Output will be $25.00 |
System.out.println(text);

Get text not contained in tags htmlagilitypack vb.net

I simply would like to know how can I get text from websites (with vb.net) that are not between HTML tags (not like this, for example, <b> hello </b>). I know that I can use HtmlAgilityPack and I have been reading and reading but I don't get nothing that can help me.
I need to extract text like this:
<td colspan="2">
<b>Some text:</b>
I need to extract this text
Some text
Some text 2
...
This is from a website that needs login first so I can't give you the URL. I would need to extract this from a website, not for HTML file or var String.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Obtain a text from a html code with BeautifulSoup - html

Use the text attribute for link in bsObj_graus.find_all('a'): list_of_links_graus.append((link.get('href'), link.text))

Related

Extracting full text from HTML span element with XPath expression

Python - Beautifulsoup, differentiate parsed text inside of an html element by using internal tags

Display text as html markup

Finding Xpath - for text without HTML tag

Get text not contained in tags htmlagilitypack vb.net

Categories

Resources