Find the paragraph number glyph - google-apps-script

I have document with multilevel numbering of paragraphs. As I traverse the paragraphs in GAS how do I get the actual numbering on each paragraph.
Eg: 1,1.2,1.2.3 etc.
I tried ListItem but the ListId returned a string identifier.

If you are only referring to NUMBER as glyph type, yes you can achieved it by using the ListItem method. The problem is the "dot" that is present in your numbers. I think this is because the only format that is supported are BULLET, HOLLOW_BULLET, SQUARE_BULLET, NUMBER, LATIN_UPPER, LATIN_LOWER, ROMAN_UPPER, and ROMAN_LOWER. As "dot" is not considered number, I think this is not just a simple code. I have found this github post, you could check if this could be a little help.

Related

How to select all elements with a specific name under every li node with the same structure?

I have a certain bunch of XPath locators that hold the elements I want to extract, and they have a similar structure:
/div/ul/li[1]/div/div[2]/a
/div/ul/li[2]/div/div[2]/a
/div/ul/li[3]/div/div[2]/a
...
They are actually simplified from Pixiv user page. Each /div/div[2]/a element has a title string, so they are actually artwork titles.
I want to use a single expression to fetch all the above a elements in an WebExtension called PageProbe. Although I've tried a bunch of methods, it just can't return the wanted result.
However, the following expression does return all the a elements, including the ones I don't need.
/div/
The following expression returns the a element under only the first li item.
/div/ul/li/div/div[2]/a
Sorry for not providing enough info earlier. Hope someone can help me out. Thanks.
According to the information you gave here you can simply use this xpath:
/div/ul/li/div/div[2]/a
however I'm quite sure it should be some better locator based on other attributes like class names etc.

Regex Between HTML Tags - VBA

I have a page full of html data that I am scraping from.
There is one occurrence of a "gross amount" field that I am trying to extract.
<h3 id="cart_trans_detail_ach_grossamount_lbl">Gross Amount</h3>
<p id="cart_trans_detail_ach_grossamount_txt">$76.99 USD</p>
All I want to get from this is $76.99 USD
I have tried using Regex Buddy and putting together but regex is not my strong suite. Even something simple like this: <p id="cart_trans_detail_ach_grossamount_txt">(.*)</p> matches the whole string and not just what is between the tags.
Any ideas?
First of all, using a regex to parse HTML is unrecommended, you should use a HTML/XML parsing library instead. But if you really feel the need to use a regular expression for that, what you are missing is the ungreedy char (?) after your (*) so that your regex stops at the first </p> it finds.
<p id="cart_trans_detail_ach_grossamount_txt">(.*?)</p>
Try this pattern:
(?<=grossamount_txt">\$)(\d*\.?\d*) USD
It works in python and php, it shall also work in Java.
The group(1) gives you back only the amount without other things.
The first parenthesis encloses a positive lookbehind which looks if before the USD amount there is a string related to "grossamount_txt">$".
then the second parenthesis try to match for a numeric amount possibily expressed in integer number and decimal numbers.
Finally there the last part of the pattern is " USD".
You can test it here
https://www.regex101.com/#python
where you can also find some more detailed explanation.
Here about how lookaround works
http://www.regular-expressions.info/lookaround.html
Hope it helps.

Finding a string that is split by multiple html tags

I am using Xpath to find a list of strings in an HTML document. The strings appear when you type into a text box, to suggest possible results - in other words, it's auto-complete. The problem is, I'm trying to retrieve the whole list of auto-complete suggestions, the results are all split up by <strong> tags.
To give a couple examples: I start typing "str" and the HTML will look like this:
<strong>str</strong>ing
But it gets better! If I don't type anything at all, every single character in the auto-complete results will be interrupted with opening and closing strong tags. Like so:
s
<strong></strong>
t
<strong></strong>
r
<strong></strong>
i
<strong></strong>
n
<strong></strong>
g
So, my question is, how do I construct an xpath that retrieves this string, but omits the strong tags?
For reference, the hierarchy of the HTML looks like this:
-div
--ul
---li
----(string I'm looking for)
---li
----(another string I'm looking for)
So my xpath at this point is: //div[#class='class']/ul/li/text(), which will get me the individual parts of the strings.
This XPath expression:
string(PathToYourDiv/ul/li[$n])
evaluates to the string value of $n-th li child of the ul that is a child of YourDiv. And this is the concatenation of all the text-node descendents od this li element -- effectively giving you the complete string you want.
You have just to substitute YourDiv and $n with specific expressions.
Do not use the // abbreviation, because:
Its evaluation can be very slow.
Indexing such an expression with [] in not intuitive and produces surprizing results that result in a FAQ.
That is much less code on the question than people would like to see around here.
But why don't you try a variant like this:
//div[#class='class']/ul/li/strong/text()

Formatting a String Array to Display to Users

What is the best format to communicate an array of strings in one string to users who are not geeks?
I could do it like this:
Item1, Item2, Item3
But that becomes meaningless when the strings contain spaces and commas.
I could also do it this way:
"Item1", "Item2", "Item3"
However, I would like to avoid escaping the array elements because escaped characters can be confusing to the uninitiated.
Edit: I should have clarified that I need the formatted string to be one-line. Basically, I have a list of lists displayed in a .Net Winforms ListView (although this question is language-agnostic). I need to show the users a one-line "snapshot" of the list next to the list's name in the ListView, so they get a general idea of what the list contains.
You can pick a character like pipe (|) which are not used much outside programs. It also used in wiki markup for tables which may be intuitive to those who are familiar with wiki markup.
Item1| Item2| Item3
In a GUI or color TUI, shade each element individually. In a monochrome TUI, add a couple of spaces and advance to the next tab position (\t) between each word.
Using JSON, the above list would look like:
'["Item1", "Item2", "Item3"]'.
This is unambiguous and a syntax in widespread use. Just explain the nested syntax a little bit and they'll probably get it.
Of course, if this is to be displayed in a UI, then you don't necessarily want unambiguous syntax as much as you want it to actually look like something intended for the end user. In that case it would depend exactly how you are displaying this to the user.
Display each element as a cell in a table.
How about line breaks after each string? :>
Display each string on a separate line, with line numbers:
1. Make a list
2. Check it twice
3. Say something nice
It's the way people write lists in the real world, y'know :)
Use some kind of typographical convention, for example a bold hashmark and space between strings.
milk # eggs # bread # apples # lettuce # carrots
CSV. Because the very first thing your non-technical user is going to do with delimited data is import it into a spreadsheet.

Variable order regex syntax

Is there a way to indicate that two or more regex phrases can occur in any order? For instance, XML attributes can be written in any order. Say that I have the following XML:
Home
Home
How would I write a match that checks the class and title and works for both cases? I'm mainly looking for the syntax that allows me to check in any order, not just matching the class and title as I can do that. Is there any way besides just including both combinations and connecting them with a '|'?
Edit: My preference would be to do it in a single regex as I'm building it programatically and also unit testing it.
No, I believe the best way to do it with a single RE is exactly as you describe. Unfortunately, it'll get very messy when your XML can have 5 different attributes, giving you a large number of different REs to check.
On the other hand, I wouldn't be doing this with an RE at all since they're not meant to be programming languages. What's wrong with the old fashioned approach of using an XML processing library?
If you're required to use an RE, this answer probably won't help much, but I believe in using the right tools for the job.
Have you considered xpath? (where attribute order doesn't matter)
//a[#class and #title]
Will select both <a> nodes as valid matches. The only caveat being that the input must be xhtml (well formed xml).
You can create a lookahead for each of the attributes and plug them into a regex for the whole tag. For example, the regex for the tag could be
<a\b[^<>]*>
If you're using this on XML you'll probably need something more elaborate. By itself, this base regex will match a tag with zero or more attributes. Then you add a lookhead for each of the attributes you want to match:
(?=[^<>]*\s+class="link")
(?=[^<>]*\s+title="Home")
The [^<>]* lets it scan ahead for the attribute, but won't let it look beyond the closing angle bracket. Matching the leading whitespace here in the lookahead serves two purposes: it's more flexible than matching it in the base regex, and it ensure that we're matching a whole attribute name. Combining them we get:
<a\b(?=[^<>]*\s+class="link")(?=[^<>]*\s+title="Home")[^<>]+>[^<>]+</a>
Of course, I've made some simplifying assumptions for the sake of clarity. I didn't allow for whitespace around the equals signs, for single-quotes or no quotes around the attribute values, or for angle brackets in the attribute values (which I hear is legal, but I've never seen it done). Plugging those leaks (if you need to) will make the regex uglier, but won't require changes to the basic structure.
You could use named groups to pull the attributes out of the tag. Run the regex and then loop over the groups doing whatever tests that you need.
Something like this (untested, using .net regex syntax with the \w for word characters and \s for whitespace):
<a ((?<key>\w+)\s?=\s?['"](?<value>\w+)['"])+ />
The easiest way would be to write a regex that picks up the <a .... > part, and then write two more regexes to pull out the class and the title. Although you could probably do it with a single regex, it would be very complicated, and probably a lot more error prone.
With a single regex you would need something like
<a[^>]*((class="([^"]*)")|(title="([^"]*)"))?((title="([^"]*)")|(class="([^"]*)"))?[^>]*>
Which is just a first hand guess without checking to see if it's even valid. Much easier to just divide and conquer the problem.
An first ad hoc solution might be to do the following.
((class|title)="[^"]*?" *)+
This is far from perfect because it allows every attribute to occur more than once. I could imagine that this might be solveable with assertions. But if you just want to extract the attributes this might already be sufficent.
If you want to match a permutation of a set of elements, you could use a combination of back references and zero-width
negative forward matching.
Say you want to match any one of these six lines:
123-abc-456-def-789-ghi-0AB
123-abc-456-ghi-789-def-0AB
123-def-456-abc-789-ghi-0AB
123-def-456-ghi-789-abc-0AB
123-ghi-456-abc-789-def-0AB
123-ghi-456-def-789-abc-0AB
You can do this with the following regex:
/123-(abc|def|ghi)-456-(?!\1)(abc|def|ghi)-789-(?!\1|\2)(abc|def|ghi)-0AB/
The back references (\1, \2), let you refer to your previous matches, and the zero
width forward matching ((?!...) ) lets you negate a positional match, saying don't match if the
contained matches at this position. Combining the two makes sure that your match is a legit permutation
of the given elements, with each possibility only occuring once.
So, for example, in ruby:
input = <<LINES
123-abc-456-abc-789-abc-0AB
123-abc-456-abc-789-def-0AB
123-abc-456-abc-789-ghi-0AB
123-abc-456-def-789-abc-0AB
123-abc-456-def-789-def-0AB
123-abc-456-def-789-ghi-0AB
123-abc-456-ghi-789-abc-0AB
123-abc-456-ghi-789-def-0AB
123-abc-456-ghi-789-ghi-0AB
123-def-456-abc-789-abc-0AB
123-def-456-abc-789-def-0AB
123-def-456-abc-789-ghi-0AB
123-def-456-def-789-abc-0AB
123-def-456-def-789-def-0AB
123-def-456-def-789-ghi-0AB
123-def-456-ghi-789-abc-0AB
123-def-456-ghi-789-def-0AB
123-def-456-ghi-789-ghi-0AB
123-ghi-456-abc-789-abc-0AB
123-ghi-456-abc-789-def-0AB
123-ghi-456-abc-789-ghi-0AB
123-ghi-456-def-789-abc-0AB
123-ghi-456-def-789-def-0AB
123-ghi-456-def-789-ghi-0AB
123-ghi-456-ghi-789-abc-0AB
123-ghi-456-ghi-789-def-0AB
123-ghi-456-ghi-789-ghi-0AB
LINES
# outputs only the permutations
puts input.grep(/123-(abc|def|ghi)-456-(?!\1)(abc|def|ghi)-789-(?!\1|\2)(abc|def|ghi)-0AB/)
For a permutation of five elements, it would be:
/1-(abc|def|ghi|jkl|mno)-
2-(?!\1)(abc|def|ghi|jkl|mno)-
3-(?!\1|\2)(abc|def|ghi|jkl|mno)-
4-(?!\1|\2|\3)(abc|def|ghi|jkl|mno)-
5-(?!\1|\2|\3|\4)(abc|def|ghi|jkl|mno)-6/x
For your example, the regex would be
/<a href="home.php" (class="link"|title="Home") (?!\1)(class="link"|title="Home")>Home<\/a>/