How can I Extract information from HTML page using Python?

How can I Extract information from HTML page using Python? - html

Based on the HTML page below, I am looking to extract this information about this property:
1- Number of bathrooms
2- Living Area
3- Energy Rating
4- Description
<div class="bloco-imovel-resumo-dados">
<div id="Cpl_modulodadosresumidos_module_holder" class="modulo-dados-resumidos">
<h2 class="lbl_descricao_dados">Property Information</h2>
<ul class="bloco-dados">
<li>
<b>Condition:</b> <span>Renewed</span></li>
<li>
<b>Living Area:</b><span> 80 m<sup>2</sup></span></li>
<li>
<b>Total Area:</b><span> 0 m<sup>2</sup></span></li>
<li>
<b>Bathrooms:</b><span> 1 </span></li>
<li>
<b>Bedrooms:</b><span> 2 </span></li>
<li>
<b>Energy Rating:</b><span> C</span></li>
</ul>
<div class="bloco-imovel-texto">
<h3 class="lbl_description">
Description </h3>
<p>At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident.Nam libero tempore, omnis dolor repellendus.</p>
</div>
I tried to extract the number of bathrooms by writing the code below, but I have received this error "AttributeError: 'HtmlElement' object has no attribute 'find_element_by_css_selector"
from lxml import html,etree
with open(r'listing.html', "r") as f:
page = f.read()
tree = html.fromstring(page)
Bathrooms = tree.find_element_by_css_selector('Bathrooms')
print('Bathrooms: {}'.format(tree.cssselect(Bathrooms)[0].text))
I am a beginner at HTML and CSS so I need your help.

import lxml.html
with open(r'listing.html', "r") as f:
page = f.read()
root=lxml.html.parse(page)
object_list = root.xpath(".//div[#class='bloco-dado']")
bathrooms = object_list[0]
text=bathrooms.text_content()
print(text)
Try this once, May this works

Related

Set different styles in a list

I'm creating a FAQ, I have an array with the questions and answers, I managed to render this array on my page, but I want to edit different styles for the questions and answers, how can I do this? here is my code:
My FAQ list:
const FAQ = ({ props }) => {
const history = useHistory();
const classes = useStyles();
const dispatch = useDispatch();
const { user, userlist, isLoading } = useSelector((state) => state.Authentication);
const [expanded, setExpanded] = React.useState(false);
const handleChange = (panel) => (event, isExpanded) => {
setExpanded(isExpanded ? panel : false);
};
const WellbeingData = [
{
index: 1,
question: "Lorem ipsum dolor sit amet?",
answer:
"Tenetur ullam rerum ad iusto possimus sequi mollitia dolore sunt quam praesentium. Tenetur ullam rerum ad iusto possimus sequi mollitia dolore sunt quam praesentium.Tenetur ullam rerum ad iusto possimus sequi mollitia dolore sunt quam praesentium.",
},
{
index: 2,
question: "Dignissimos sequi architecto?",
answer:
"Aperiam ab atque incidunt dolores ullam est, earum ipsa recusandae velit cumque. Aperiam ab atque incidunt dolores ullam est, earum ipsa recusandae velit cumque.",
},
{
index: 3,
question: "Voluptas praesentium facere?",
answer:
"Blanditiis aliquid adipisci quisquam reiciendis voluptates itaque.",
},
];
and here is my last failed try to render it correctly
const questionList = WellbeingData.map((data) => <div style={{fontSize: 18, fontWeight: 'bold'}}><li key={data.index}>{data.question}</li></div>)
const answerList = WellbeingData.map((data) => <div style={{fontSize: 16}}><li key={data.index}>{data.answer}</li></div>)
{questionList}
<br/>
{answerList}
What could I do to solve this...

If the styles are limited, lets say "Answered", "New", "Deleted", you can have one css class for each style (e.g .deletedAnswer .deletedQuestion) and add a new element to your object like type: something, then dynamically choose a class based on the value of type. the easiest way is to set the value to the class name:
{
index: 3,
question: "Voluptas praesentium facere?",
answer:
"Blanditiis aliquid adipisci quisquam reiciendis voluptates itaque.",
class: "deletedAnswer"
}
And then just putting the value of array[index].class to your element's class.

How to get the text of a section on an HTML page

I have a webpage with this section:
<section class="post__content">
At vero eos et accusamus et iusto odio dignissimos ducimus qui
blanditiis praesentium voluptatum deleniti atque corrupti quos
dolores et quas molestias excepturi sint occaecati cupiditate
non provident.
</section>
I need to get the text from that section.
I tried:
https://www.example.com/quote-of-the-day?d=01/09/2021/#post__content

IF the text is on the same PAGE as the script
console.log(document.querySelector("section.post__content").textContent
.replace(/(\r\n|\n|\r)/gm,"") // get rid of trailing newlines, remove if you need them
)
<section class="post__content">
At vero eos et accusamus et iusto odio dignissimos ducimus qui
blanditiis praesentium voluptatum deleniti atque corrupti quos
dolores et quas molestias excepturi sint occaecati cupiditate
non provident.
</section>
IF the text is on ANOTHER page which is on the same server as where you want to grab it, you can use jQuery or fetch. Here jQuery can parse the page and just grab the section
$.get("/quote-of-the-day?d=01/09/2021/ section.post__content",function(data) { console.log(data) })
Please note that class is .post__content and ID is #post__content
If NOT, then you need some server process

Try this:
const sectionText = document.querySelector("section.post__content").textContent
console.log(sectionText)
You can even create a function that gets an element's text:
const getElementText = selector => {
return document.querySelector(selector).textContent
}
const sectionText = getElementText("section.post__content")
console.log(sectionText)
NOTE: If you have more section tags with post__content class, then you should use document.querySelectorAll() which return an array of all nodes.

Correct HTML for multi-paragraph quotes

I have text written like so:
"Something something," he said. "Lorem ipsum dolor sit amet, ne natum
elitr his, usu ex dictas everti, utamur reformidans ad vis. Eam
dissentiet deterruisset an, vis cu nullam lobortis. Doming inimicus eu
nec, laudem audire ceteros cu vis, et per eligendi splendide. Ne
legere tacimates instructior qui. Te vis dicat iudico integre, ex est
prima constituam consequuntur. Vix sanctus voluptaria ei, usu ornatus
iracundia ne, nam nulla iudico no. Duo ei labores nusquam.
"In harum docendi fuisset vis. Meis constituam ea quo. Ei vim prima
liber officiis. Ad modo tota augue est, fugit soleat blandit eos ex."
The text follows an annoying typographical rule for multi-paragraph quotes: a quote which spans a paragraph doesn't have an end quote at the end of each internal paragraph, but it has an extra one at the start of each internal paragraph. HTML5 doesn't allow <p> elements inside <q>elements, but this situation is worse: the second quote doesn't even include all of the paragraph, so even if it did (or if I used, say, <div class=quote>) I can't see a way to mark this up without mis-aligned elements.
(Of course I could just embed quotation marks rather than use <q>, but I'm looking for thoughts on the best way to do this.)

The previous answer (redefining quotes: '\201c' '\201d' '\2018' '\2019';) works, but it overrides the user's locale and forces American style quotation marks.
Instead, this:
<style>
q.continued::after { font-size: 0; }
</style>
…
<p>
I replied, <q class="continued">Blah blah blah.</q>
</p>
<p>
<q class="continued">And not only that!</q>
</p>
<p>
<q>So there!</q>, and left the room.
</p>
Produces:
I replied, ”Blah blah blah.
”And not only that
”So there!”, and left the room.
Which should work correctly with any locale's quotation characters.
Note that display:none would be better, but it doesn't work.
The browser (Chrome at least) notices that the closing quote is missing and switches to the alternate quote mark for the following paragraphs.

The cleanest way to do this is indeed with quotation marks; they are just as semantically appropriate as using the <q> element. From the spec:
The use of q elements to mark up quotations is entirely optional; using explicit quotation punctuation without q elements is just as correct.
Code Example:
In the following example, quotation marks are used instead of the q element:
<p>His best argument was ❝I disagree❞, which
I thought was laughable.</p>
Having said that, you can still do this using <div class=quote> to mark the start and end of a multi-paragraph quotation as you've suggested, coupled with the following CSS:
q {
quotes: '\201c' '\201d' '\2018' '\2019';
}
.quote > p:not(:last-of-type) > q:last-child {
quotes: '\201c' '' '\2018' '';
}
<div class=quote>
<p><q>Something something,</q> he said. <q>Lorem ipsum dolor sit amet,
ne natum elitr his, usu ex dictas everti, utamur reformidans ad vis.
Eam dissentiet deterruisset an, vis cu nullam lobortis. Doming
inimicus eu nec, laudem audire ceteros cu vis, et per eligendi
splendide. Ne legere tacimates instructior qui. Te vis dicat iudico
integre, ex est prima constituam consequuntur. Vix sanctus voluptaria
ei, usu ornatus iracundia ne, nam nulla iudico no. Duo ei labores
nusquam.</q></p>
<p><q>In harum docendi fuisset vis. Meis constituam ea quo. Ei vim prima
liber officiis. Ad modo tota augue est, fugit soleat blandit eos ex.</q></p>
</div>
But this requires using <div class=quote> wherever necessary in the first place. Granted, as it's just a div, it doesn't set the text containing the quotation apart from the rest of the prose (in contrast, a blockquote would be entirely inappropriate for this reason), or otherwise change the meaning of the text, but it's still not as clean as you might like. It does however work regardless of whether that second paragraph has been represented in its entirety or if there is more text following the closing quotation mark — or indeed, even if the second paragraph contains both the end of one quotation and the start of another (just move your </div> end tag to where the second quotation ends).
You'll notice in the above snippet that the <q> elements themselves are split by paragraph; this is perfectly normal since <q> is a phrasing element and therefore, as you've stated, cannot span multiple flow elements. But if you really are worried that the split <q> elements will be seen (particularly by AT) as two separate quotations altogether, you can either associate them using a class or a custom data attribute, or just go with quotation marks which are much simpler and will convey the meaning of the text just as effectively.

How to extract nested fields from MongoDB?

So my problem is that I have a database structure (designed by someone else and I have to work on it now) as follows:
DBS:
Database 1
Database 2
Database 3
Collection 1
Collection 2
field_1
field_1_1
field_1_1_1
field_1_2
field_2
field_3
Collection 3
Collection 4
Database 4
Now I want to extract the field field_1_1_1 any idea how I can query that?
So far I have tried applying find_one on Database 3.Collection 2.filed_1.field_1_1.field_1_1_1 but obviously it did not work.
So here goes the actual content as requested. This is what 1 item in the collection "tempStorage" under the database "workApp" looks like.
{"_id":{"tag":"i4x","org":"Temp","course":"CXV_08","category":"about","name":"overview","revision":null},
"definition":
{"data":
{"data":
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
uis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
onsequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
illum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
roident, sunt in culpa qui officia deserunt mollit anim id est laborum"
}
}
}
Edit: Uploaded the actual fields.
What does work is if I export the entire collection and then parse through it but the data is already .4 GB and I do not think that that can be the only option and that there must exist something better.
Anyone with good experience in MongoDB who can help me out?

Try this:
var data = db.tempStorage.find()
data[0].definition.data.data
Here, db.tempStorage.find() will give you all the results in an Array variable data, which you can iterate by passing an index value and using dot notation to reach out deep into the document as I have done in data[0].definition.data.data.
If you have only 1 document within collection than findOne() can also be used.
var data = db.tempStorage.findOne()
data.definition.data.data
HTH!
Thanks.

WPF split HTML string for every TextBlock

Does anyone know how to split the HTML string for every TextBlock? I split them, but something is wrong. The lines count are not the same in every page. How do I solve this issue?
Here's the code:
XAML:
<Window x:Class="Ebook.MainWindow"
xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
xmlns:controls="clr-namespace:WPFMitsuControls;assembly=WPFMitsuControls"
Title="eBook" Height="600" Width="800"
Loaded="MainWindow_OnLoaded" Background="Silver">
<DockPanel>
<Viewbox Margin="10">
<Grid>
<controls:Book x:Name="myBook" Width="600" Height="400" Margin="20">
<controls:Book.ItemTemplate>
<DataTemplate>
<Border BorderThickness="4" BorderBrush="Gray" Background="White">
<ContentControl Content="{Binding .}" />
</Border>
</DataTemplate>
</controls:Book.ItemTemplate>
</controls:Book>
<Button Content="<" HorizontalAlignment="Left" VerticalAlignment="Center" VerticalContentAlignment="Center" Background="Transparent" Height="50" Click="AutoPreviousClick" />
<Button Content=">" HorizontalAlignment="Right" VerticalAlignment="Center" VerticalContentAlignment="Center" Background="Transparent" Height="50" Click="AutoNextClick" />
</Grid>
</Viewbox>
</DockPanel>
</Window>
CODE:
using System;
using System.Collections.Generic;
using System.Text;
using System.Windows;
using System.Windows.Controls;
using System.Windows.Media;
using System.Globalization;
using System.IO;
namespace Ebook
{
public partial class MainWindow : Window
{
public MainWindow()
{
InitializeComponent();
}
string html_str = "<p>The standard <b>Lorem Ipsum</b> passage, used since the 1500s</p>"
+ "<p>'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'</p>"
+ "<p>Section 1.10.32 of 'de Finibus Bonorum et Malorum', written by Cicero in 45 BC</p>"
+ "<p>'Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?'</p>"
+ "<p><u>1914 translation by H. Rackham</u></p>"
+ "<p>'But I must explain to you how all this mistaken idea of denouncing pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness. No one rejects, dislikes, or avoids pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure rationally encounter consequences that are extremely painful. Nor again is there anyone who loves or pursues or desires to obtain pain of itself, because it is pain, but because occasionally circumstances occur in which toil and pain can procure him some great pleasure. To take a trivial example, which of us ever undertakes laborious physical exercise, except to obtain some advantage from it? But who has any right to find fault with a man who chooses to enjoy a pleasure that has no annoying consequences, or one who avoids a pain that produces no resultant pleasure?'</p>"
+ "<p>Section 1.10.33 of 'de Finibus Bonorum et Malorum', written by Cicero in 45 BC</p>"
+ "<p>'At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga. Et harum quidem rerum facilis est et expedita distinctio. Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere possimus, omnis voluptas assumenda est, omnis dolor repellendus. Temporibus autem quibusdam et aut officiis debitis aut rerum necessitatibus saepe eveniet ut et voluptates repudiandae sint et molestiae non recusandae. Itaque earum rerum hic tenetur a sapiente delectus, ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat.'</p>"
+ "<p><i>1914 translation by H. Rackham</i></p>"
+ "<p>'On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain. These cases are perfectly simple and easy to distinguish. In a free hour, when our power of choice is untrammelled and when nothing prevents our being able to do what we like best, every pleasure is to be welcomed and every pain avoided. But in certain circumstances and owing to the claims of duty or the obligations of business it will frequently occur that pleasures have to be repudiated and annoyances accepted. The wise man therefore always holds in these matters to this principle of selection: he rejects pleasures to secure other greater pleasures, or else he endures pains to avoid worse pains.'</p>";
private void AutoNextClick(object sender, RoutedEventArgs e)
{
myBook.AnimateToNextPage(false, 700);
myBook.Focus();
}
private void AutoPreviousClick(object sender, RoutedEventArgs e)
{
myBook.AnimateToPreviousPage(false, 700);
myBook.Focus();
}
private void MainWindow_OnLoaded(object sender, RoutedEventArgs e)
{
List<string> lines = WrapText(html_str, 200, "Arial", 12);
int rows = 20;
int pages = lines.Count / rows;
string _Page = "";
int j = 0, k = 0;
TextBlock tb1;
for (int i = 0; i < pages; i++)
{
tb1 = new TextBlock();
tb1.TextWrapping = TextWrapping.Wrap;
tb1.Margin = new Thickness(5);
tb1.TextAlignment = TextAlignment.Justify;
_Page = "";
for (j = 0; j < rows; j++)
{
_Page += lines[rows * i + j] + "\n";
k++;
}
//myBook.Items.Add(new TextBlock { Text = _Page, TextWrapping = TextWrapping.Wrap, Margin = new Thickness(5), TextAlignment = TextAlignment.Justify });
tb1.Inlines.AddRange(MarkupProcessor.HTMLToWPF(_Page));
myBook.Items.Add(tb1);
}
}
static List<string> WrapText(string text, double pixels, string fontFamily, float emSize)
{
string[] originalLines = text.Split(new string[] { " " },
StringSplitOptions.None);
List<string> wrappedLines = new List<string>();
StringBuilder actualLine = new StringBuilder();
double actualWidth = 0;
foreach (var item in originalLines)
{
FormattedText formatted = new FormattedText(item,
CultureInfo.CurrentCulture,
System.Windows.FlowDirection.LeftToRight,
new Typeface(fontFamily), emSize, Brushes.Black);
actualLine.Append(item + " ");
actualWidth += formatted.Width;
if (actualWidth > pixels)
{
wrappedLines.Add(actualLine.ToString());
actualLine = new StringBuilder();
actualWidth = 0;
}
}
if (actualLine.Length > 0)
wrappedLines.Add(actualLine.ToString());
return wrappedLines;
}
}
}
http://i.stack.imgur.com/AEpsm.png
Here's the download link to the project: http://www.megafileupload.com/en/file/532650/WPF-Ebook-zip.html

You can`t control number of lines because of this:
tb1.TextWrapping = TextWrapping.Wrap;
Line count will depend on length of string. Long strings will be wrapped.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008