Get cell paragraph without losing the current style - python-docx

So, I got this func running
def get_bold_lines_from_cell(cellColumn, cellRow):
for index, paragraph in enumerate(table.cell(cellRow, cellColumn).paragraphs):
for run in paragraph.runs:
if run.bold:
#do stuff
Even though the paragraph is filled with bold paragraphs it just doesn't recognize any. Does it loses it style because I've turned a docx into tables? is there anyway to get the paragraph style?
Thanks!

If anybody ever have the same problem, this is the solution that I've came up with
for table in tables:
cell = table._cells[cellNumber]
for paragraphIndex, paragraph in enumerate(cell.paragraphs):
for parentParagraphsIndex, parentParagraphs in enumerate(paragraph._parent.paragraphs):
for run in parentParagraphs.runs:
tempString = parentParagraphs.text.encode('utf-8')
if run.bold:
#do stuff
break
elif run.style.style_id == "Strong":
#do stuff
break
else:
#do stuff
break

Best bet is to look at the XML for each object to find clues.
print paragraph._element.xml
print run._element.xml
If there's a style applied you'll see it in the w:pPr or w:rPr element.

Related

set auto indent on newline between auto closed HTML tag in vim (<div>|</div>) [duplicate]

I most IDEs and modern text editors (Sublime Text 3) the cursor is correctly indented after inserting a newline in between an html tag (aka 'expanding" the tag):
Before:
<div>|</div>
After pressing CR:
<div>
|
</div>
But in Vim, this is what I get:
<div>
|</div>
How can I get the same behaviour in Vim like in most other editors (see above)?
The only correct behavior of <CR> in insert mode is to break the line at the cursor.
What you want is an enhanced behavior and you need to add something to your config to get it: a mapping, a short function or a full fledged plugin.
When I started to use vim, that behavior was actually one of the first things I added to my vimrc. I've changed it many times in the past but this mapping has been quite stable for a while:
inoremap <leader><CR> <CR><C-o>==<C-o>O
I've used <leader><CR> to keep the normal behavior of <CR>.
Here is a small function that seems to do what you want:
function! Expander()
let line = getline(".")
let col = col(".")
let first = line[col-2]
let second = line[col-1]
let third = line[col]
if first ==# ">"
if second ==# "<" && third ==# "/"
return "\<CR>\<C-o>==\<C-o>O"
else
return "\<CR>"
endif
else
return "\<CR>"
endif
endfunction
inoremap <expr> <CR> Expander()
This little snippet will remap Enter in insert mode to test whether or not the cursor is between > and < and act accordingly if it is. Depending on your indent settings the \<Tab> may need to be removed.
It will not play nice with other plugins that might be also be mapping the Enter key so be aware that there is probably more work to do if you want that compatibility.
function EnterOrIndentTag()
let line = getline(".")
let col = getpos(".")[2]
let before = line[col-2]
let after = line[col-1]
if before == ">" && after == "<"
return "\<Enter>\<C-o>O\<Tab>"
endif
return "\<Enter>"
endfunction
inoremap <expr> <Enter> EnterOrIndentTag()
I have only tested the simple cases (beginning of the line, end of the line, inside and outside of ><), there are probably edge cases that this would not catch.
#RandyMorris and #romainl have posted good solutions for your exact problem.
There are some other possibilities you might be interested in if you are typing out these tags yourself: there's the ragtag.vim plugin for HTML/XML editing.
With ragtag.vim you type this to create your "before" situation (in insert mode):
div<C-X><Space>
To create your "after" situation you would instead type:
div<C-X><Enter>
So if you know beforehand that you are going to "expand" the tag, typing just the element name and the combo CtrlX followed by Enter is enough.
There are also other more advanced plugins to save keystrokes when editing HTML, such as ZenCoding.vim and Sparkup.
Since no one have mentioned it I will. There is excellent plugin that does exactly that
delemitmate

Mixed results with perl regex, matching list of phrases in html code

Mixed results with regex, matching list of phrases in html code
This new post was in response to another post, Perl Regex match lines that contain multiple words, but was, for reasons unknown to me, deleted by the moderator. It seemed logical to me to ask the question in the original thread because it has to do with an attempt to use the solution given early on in that thread, and a problem with it. There was a generic reference to the faq, which didn't seem to reveal any discrepancies, and the message, "If you have a question, please post your own question." Hence this post.
I am using LWP::Simple to get a web page and then trying to match lines that contain certain phrases. I copied the regex in answer #1 in the above-mentioned thread, and replaced/added words that I need to match, but I am getting mixed results with two similar but different web pages.
The regex I am using is:
/^(?=.*?\bYear\b)(?=.*?\bNew Moon\b)(?=.*?\bFirst Quarter\b)(?=.*?\bFull Moon\b)(?=.*?\bLast Quarter\b).*$/gim
For web site #1, which has bare lines containing these words, in a series of blocks surrounded by <pre>..</pre> tags, it matches all lines exactly equal to this one, as expected:
Year New Moon First Quarter Full Moon Last Quarter
BUT for web site #2, which has nasty little tags surrounding the words:
<br><br><span class="prehead"> Year New Moon First Quarter Full Moon Last Quarter ΔT</span><br>
it matches EVERY line!
I'm sure the <span> tags are the "proper" way to do this but I am wondering how to get around those tags so I can have just one regex for both sites. Is there a simple way to do this or do I have to learn how to parse html (something I'd rather not have to do)?
I'm looking for a quick solution, not a robust one. This is probably a one-time-only deal. If these relatively static pages change, it will probably be minor and easy to fix. Please don't refer me to all the 'anti-regex-for-html' pages. I've seen 'em. And please don't make me use HTML::TreeBuilder. Oh please...
If I am correct in my assumption, you would like to match only the specific sequence of words:
Year New Moon First Quarter Full Moon Last Quarter
with free spacing regardless of the tags at the ends.
We can use this to match any properly formatted opening and closing tags at either end
<[^>]*?>
Which means, any string that is between an opening "<" and the first closing ">",
Next we want to make sure we allow for spaces between those tags so we use the whitespace indicator "\s*" for zero or more whitespace at either end:
\s*<[^>]*?>\s*
Next we want to group that in a non-capturing (for efficiency) group and let it repeat zero or more times. This is what we will put at either end of the regex to make sure the tags are matched:
(?:\s*<[^>]*?>\s*)*
Then we will fill in the desired text using the "\s*" between phrases to make sure space and only space is allowed between them:
(?:\s*<[^>]*?>\s*)*\s*Year\s*New Moon\s*First Quarter\s*Full Moon\s*Last Quarter\s*(?:\s*<[^>]*?>\s*)*
Then finish off with the line beginning and end line markers
/^(?:\s*<[^>]*?>\s*)*\s*Year\s*New Moon\s*First Quarter\s*Full Moon\s*Last Quarter\s*(?:\s*<[^>]*?>\s*)*$/gim
This should match any lines containing an arbitrary number of tags at either end of the desired phrases, but not match if anything else comes in such as additional characters. It should also be pretty efficient because it doesn't use any look-arounds. Let me know if I misunderstood the question though.
I finally got this working for both urls using the original regex by looping through the retrieved html document directly:
for my $line (split qr/\R/, $doc)
{
next unless $line =~ /^(?=.*?\bYear\b)(?=.*?\bNew Moon\b)(?=.*?\bFirst Quarter\b)(?=.*?\bFull Moon\b)(?=.*?\bLast Quarter\b).*$/gim; # original
print "$line\n";
}
It really shouldn't be this difficult. ;-)
#Jake:
Hey thanks a lot for this. You are the person I am looking for. I tried it and it works with the first url but outputs nothing for the second one.
Using my original regex, I also tried stripping the html tags with HTML::TreeBuilder:
my $tree = HTML::TreeBuilder->new;
$tree->parse_file($doc);
my $non_html = $tree->as_text();
open FILE, "<", \$non_html or die "can't open $non_html: $!\n";
with no results for either url.
I tried HTML::Strip:
my $hs = HTML::Strip->new();
my $clean_text = $hs->parse($doc);
$hs->eof;
open FILE, "<", \$clean_text or die "can't open $clean_text: $!\n";
with same results as original--first url works as expected, second one outputs all (stripped) lines. Maybe there is a problem with my code here. I don't know.
Here is the essence of my script (this runs):
use strict;
use warnings;
use LWP::Simple;
my $url = 'http://eclipse.gsfc.nasa.gov/phase/phases2001.html';
#my $url = 'http://www.astropixels.com/ephemeris/moon/phases2001gmt.html';
my $doc = get $url;
die "Couldn't get $url" unless defined $doc;
open FILE, "<", \$doc or die "can't open $doc: $!\n";
while(my $line = <FILE>)
{
#next unless $line =~ /^(?=.*?\bYear\b)(?=.*?\bNew Moon\b)(?=.*?\bFirst Quarter\b)(?=.*?\bFull Moon\b)(?=.*?\bLast Quarter\b).*$/gim; # original
next unless $line =~ /^(?:\s*<[^>]*?>\s*)*\s*Year\s*New Moon\s*First Quarter\s*Full Moon\s*Last Quarter\s*(?:\s*<[^>]*?>\s*)*$/gim; # Jake's
print "$line";
}

How do I test for the percentage of bold text on a webpage?

I have an instance where I need to test how page content is styled (not necessarily only with CSS).
For example, a test (cucumber) I would like to write is:
In order to standardize text weight
As a webmaster
I want to be told the percentage of bold text on the page
The problem is, I'm having a hard time figuring out how to actually generate this result. Looking at various HTML testing frameworks (Selenium, Watir, Capybara), it seems like I can only test for the presence of tags or the presence of css classes, and not the calculated visual result.
In Firebug, I can see the calculated CSS result (which works for <strong>, <b>, and font-weight:bold definitions), but I need to be able to put this into a testing framework to run under CI.
In Watir, you can get access to an elements font-weight by directly accessing the win32ole object. For example:
ie.div(:index, 1).document.currentStyle.fontWeight
This will give you a numbers representing the weight as described in http://www.w3schools.com/cssref/pr_font_weight.asp
What I think you would then need to do is iterate through all elements on the page checking what its fontWeight is and how much text is in the element. The way you do that will depend on the page you are testing.
Solution 1 - If all text is in divs that are leaf nodes:
If all your text is in leaf nodes like this:
<body>
<div style='font-weight:bold'>Bold</div>
<div>Plain</div>
</body>
You could easily do:
bold_text = 0
plain_text = 0
ie.divs.each{ |x|
if x.document.currentStyle.fontWeight >= 700
bold_text += x.text.length
else
plain_text += x.text.length
end
}
Solution 2 - If styles interact or using multiple elements:
If not all of the text is in leaf nodes or you use other tags like <b> (see example HTML below), you would need a more complicated check. This is due to .text returning all text in the element, including its children elements.
<body>
<div style='font-weight:normal'>
Start
<div style='font-weight:bold'>Bold1</div>
<div style='font-weight:bold'>Bold2</div>
End
</div>
<b>Bold Text</b>
</body>
In this case, I believe the following works for most cases (but may need refinement):
#Counting letters, but you could easily change to words
bold_count = 0
plain_count = 0
#Check all elements, though you can change this to restrict to a particular containing element if desired.
node_list = ie.document.getElementsByTagName("*")
0.upto(node_list.length-1) do |i|
#Name the node so it is easier to work with.
node = node_list["#{i}"]
#Determine if the text for the current node is bold or not.
#Note that this works in IE. You might need to modify for other browsers.
if node.currentStyle.fontWeight >= 700
bold = true
else
bold = false
end
#Go through the childNodes. If the node is text, count it. Otherwise ignore.
node.childNodes.each do |child|
unless child.nodeValue.nil?
if bold
bold_count += child.nodeValue.length
else
plain_count += child.nodeValue.length
end
end
end
end
#Determine number of characters that are bold and not. These can be used to determine your percentage.
puts bold_count
puts plain_count
It is not a very Watir-like solution, but hopefully solves your problem.

Find and replace entire HTML nodes with Nokogiri

i have an HTML, that should be transformed, having some tags replaced with another tags.
I don't know about these tags, because they will come from db. So, set_attribute or name methods of Nokogiri are not suitable for me.
I need to do it, in a way, like in this pseudo-code:
def preprocess_content
doc = Nokogiri::HTML( self.content )
doc.css("div.to-replace").each do |div|
# "get_html_text" will obtain HTML from db. It can be anything, even another tags, tag groups etc.
div.replace self.get_html_text
end
self.content = doc.css("body").first.inner_html
end
I found Nokogiri::XML::Node::replace method. I think, it is the right direction.
This method expects some node_or_tags parameter.
Which method should i use to create a new Node from text and replace the current one with it?
Like that:
doc.css("div.to-replace").each do |div|
new_node = doc.create_element "span"
new_node.inner_html = self.get_html_text
div.replace new_node
end

Get the type of an element in Hpricot

I want to go through the children of an element and filter only the ones that are text or span, something like:
element.children.select {|child|
child.class == String || child.element_type == 'span'
}
but I can't find a way to test which type a certain element is. How do I test that? I'd like to know that regardless if there's a better way of doing what I'm trying to do, but I also appreciate suggestions on that.
Found it:
element.name
#=> "span"