Formatting a String Array to Display to Users - language-agnostic

What is the best format to communicate an array of strings in one string to users who are not geeks?
I could do it like this:
Item1, Item2, Item3
But that becomes meaningless when the strings contain spaces and commas.
I could also do it this way:
"Item1", "Item2", "Item3"
However, I would like to avoid escaping the array elements because escaped characters can be confusing to the uninitiated.
Edit: I should have clarified that I need the formatted string to be one-line. Basically, I have a list of lists displayed in a .Net Winforms ListView (although this question is language-agnostic). I need to show the users a one-line "snapshot" of the list next to the list's name in the ListView, so they get a general idea of what the list contains.

You can pick a character like pipe (|) which are not used much outside programs. It also used in wiki markup for tables which may be intuitive to those who are familiar with wiki markup.
Item1| Item2| Item3

In a GUI or color TUI, shade each element individually. In a monochrome TUI, add a couple of spaces and advance to the next tab position (\t) between each word.

Using JSON, the above list would look like:
'["Item1", "Item2", "Item3"]'.
This is unambiguous and a syntax in widespread use. Just explain the nested syntax a little bit and they'll probably get it.
Of course, if this is to be displayed in a UI, then you don't necessarily want unambiguous syntax as much as you want it to actually look like something intended for the end user. In that case it would depend exactly how you are displaying this to the user.

Display each element as a cell in a table.

How about line breaks after each string? :>

Display each string on a separate line, with line numbers:
1. Make a list
2. Check it twice
3. Say something nice
It's the way people write lists in the real world, y'know :)

Use some kind of typographical convention, for example a bold hashmark and space between strings.
milk # eggs # bread # apples # lettuce # carrots

CSV. Because the very first thing your non-technical user is going to do with delimited data is import it into a spreadsheet.

Related

Find the paragraph number glyph

I have document with multilevel numbering of paragraphs. As I traverse the paragraphs in GAS how do I get the actual numbering on each paragraph.
Eg: 1,1.2,1.2.3 etc.
I tried ListItem but the ListId returned a string identifier.
If you are only referring to NUMBER as glyph type, yes you can achieved it by using the ListItem method. The problem is the "dot" that is present in your numbers. I think this is because the only format that is supported are BULLET, HOLLOW_BULLET, SQUARE_BULLET, NUMBER, LATIN_UPPER, LATIN_LOWER, ROMAN_UPPER, and ROMAN_LOWER. As "dot" is not considered number, I think this is not just a simple code. I have found this github post, you could check if this could be a little help.

What is regex expression for a string after I use nokogiri to scrape

I have this string and it is in an html document of 100 other names that are formatted the same:
<li>Physical education sed<span class="meta"><ul><li>15184745922</li></ul></span>
</li>
And I want to save 'Physical education sed under a name column and '15184745922' under a number column.
I was wondering how do you do this in Ruby.
In nokogiri I can get only the li's by doing this:
puts page.css("ul li").text
but then it comes out all in one word:"Physical education sed15184745922"
I was thinking regex is the way to go but I am stumped with that.
I did split it on the li
full_contact = page.css("ul li")[22]
split_contact_on_li = full_contact.to_s.split(/(\W|^)li(\W|$)/).map(&:to_sym)
puts split_contact_on_li
and I get this
<
>
Physical education sed<span class="meta"><ul>
<
>
15184745922<
/
>
</ul></span>
<
/
>
The same number of lines will be shown for each contact_info and the name is always the third line before the span class and the number is always the 6th line.
There is an instance where there might be an email address instead on the 6th line put not often.
So should I match the second and the third angular bracket and pull the information up to the third and fourth bracket then shove it into an array called name and number?
You shouldn't use a regex to parse xhtml since the regex engine might mess up things, you should use a html parser instead. However, if you want to use a regex, you can use a regex like this:
<li>(.*?)<.*?<li>(.*?)<
Working demo
The idea behind this regex is to use capturing groups (using paretheses) to capture the content you want. So, for you sample input the match information is:
MATCH 1
Group 1. [4-26] `Physical education sed`
Group 2. [53-64] `15184745922`
For example;
#!/usr/bin/env ruby
string = "<li>Physical education sed<span class=\"meta\"><ul><li>15184745922</li></ul></span></li>"
one, two = string.match(/<li>(.*?)<.*?<li>(.*?)</i).captures
p one #=> "Physical education sed"
p two #=> "15184745922"
Why don't you just do a regex on the string "physical education sed15184745922"? You can match on the first digit, and get back the number and the preceding text.
I don't know how to use Ruby, but if I understand your question correctly I would take advantage of the gsub function (or Ruby's equivalent). It might not be the prettiest approach, but since we just want the text in one variable and the numbers in another, we can just replace the characters we don't want with empty values.
v1 = page.css('ul li').text
v2 = gsub('\d*', '', v1)
v3 = gsub('(^\d)', '', v1)
v1 gets the full text value, v2 replaces all numeric characters with '', and v3 replaces all alpha characters with '', giving us two new variables to put wherever we please.
Again, I don't know how to use Ruby, but in R I know that I could get all of the values from the page using the xpath you provided ("ul li") into a vector and then loop across the vector performing the above steps on each element. I'm not sure if that adequately answers your question, but hopefully the gsub function gets you closer to what you want.
You need to use your HTML parser (Nokogiri) and regular expressions together. First, use Nokogiri to traverse down to the first parent node that contains all the text you need, then regex the text to get what you need.
Also, consider using .xpath instead of .css it provides much more functionality to search and scrape for just what you want. Given your example, you could do like so:
page.xpath("//span[#class='meta']/parent::li").map do |i|
i.text.scan(/^([a-z\s]+)(\d+)$/i).flatten
end
#=> [['Physical education sed', '15184745922'], ['the next string', '1234567890'], ...]
And now you have a two-dimensional array you can iterate over and save each pair.
This bit of xpath business: "//span[#class='meta']/parent::li" is doing what .css can't do, returning the parent node that has the text and specific children nodes you want to scrape against.

Remove first line from HTML Markup Field using RegEx

I have a single text field that contains HTML markup. The system that generates this field content always seems to generate a first line with a non-visible carriage return value in it and I can't seem to prevent if from doing so.
Does anyone know of a way (perhaps using a Regular Expression), to remove that first line from this text field?
I'd prefer to leave all other instances of the carriage return values in the field as is, so if it's a RegEx statement that will just remove the first line of a text field, that would work for me.
Any suggestions most welcomed.
Cheers,
Wayne
Usually the trim (often removes whitespaces, CR ) method is used for this in many programming languages. You did not state in what language you will be doing this...

How can I retrieve a collection of values from nested HTML-like elements using RegExp?

I have a problem creating a regular expression for the following task:
Suppose we have HTML-like text of the kind:
<x>...<y>a</y>...<y>b</y>...</x>
I want to get a collection of values inside <y></y> tags located inside a given <x> tag, so the result of the above example would be a collection of two elements ["a","b"].
Additionally, we know that:
<y> tags cannot be enclosed in other <y> tags
... can include any text or other tags.
How can I achieve this with RegExp?
This is a job for an HTML/XML parser. You could do it with regular expressions, but it would be very messy. There are examples in the page I linked to.
I'm taking your word on this:
"y" tags cannot be enclosed in other "y" tags
input looks like: <x>...<y>a</y>...<y>b</y>...</x>
and the fact that everything else is also not nested and correctly formatted. (Disclaimer: If it is not, it's not my fault.)
First, find the contents of any X tags with a loop over the matches of this:
<x[^>]*>(.*?)</x>
Then (in the loop body) find any Y tags within match group 1 of the "outer" match from above:
<y[^>]*>(.*?)</y>
Pseudo-code:
input = "<x>...<y>a</y>...<y>b</y>...</x>"
x_re = "<x[^>]*>(.*?)</x>"
y_re = "<y[^>]*>(.*?)</y>"
for each x_match in input.match_all(x_re)
for each y_match in x_match.group(1).value.match_all(y_re)
print y_match.group(1).value
next y_match
next x_match
Pseudo-output:
a
b
Further clarification in the comments revealed that there is an arbitrary amount of Y elements within any X element. This means there can be no single regex that matches them and extracts their contents.
Short and simple: Use XPath :)
It would help if we knew what language or tool you're using; there's a great deal of variation in syntax, semantics, and capabilities. Here's one way to do it in Java:
String str = "<y>c</y>...<x>...<y>a</y>...<y>b</y>...</x>...<y>d</y>";
String regex = "<y[^>]*+>(?=(?:[^<]++|<(?!/?+x\\b))*+</x>)(.*?)</y>";
Matcher m = Pattern.compile(regex).matcher(str);
while (m.find())
{
System.out.println(m.group(1));
}
Once I've matched a <y>, I use a lookahead to affirm that there's a </x> somewhere up ahead, but there's no <x> between the current position and it. Assuming the pseudo-HTML is reasonably well-formed, that means the current match position is inside an "x" element.
I used possessive quantifiers heavily because they make things like this so much easier, but as you can see, the regex is still a bit of a monster. Aside from Java, the only regex flavors I know of that support possessive quantifiers are PHP and the JGS tools (RegexBuddy/PowerGrep/EditPad Pro). On the other hand, many languages provide a way to get all of the matches at once, but in Java I had to code my own loop for that.
So it is possible to do this job with one regex, but a very complicated one, and both the regex and the enclosing code have to be tailored to the language you're working in.

Variable order regex syntax

Is there a way to indicate that two or more regex phrases can occur in any order? For instance, XML attributes can be written in any order. Say that I have the following XML:
Home
Home
How would I write a match that checks the class and title and works for both cases? I'm mainly looking for the syntax that allows me to check in any order, not just matching the class and title as I can do that. Is there any way besides just including both combinations and connecting them with a '|'?
Edit: My preference would be to do it in a single regex as I'm building it programatically and also unit testing it.
No, I believe the best way to do it with a single RE is exactly as you describe. Unfortunately, it'll get very messy when your XML can have 5 different attributes, giving you a large number of different REs to check.
On the other hand, I wouldn't be doing this with an RE at all since they're not meant to be programming languages. What's wrong with the old fashioned approach of using an XML processing library?
If you're required to use an RE, this answer probably won't help much, but I believe in using the right tools for the job.
Have you considered xpath? (where attribute order doesn't matter)
//a[#class and #title]
Will select both <a> nodes as valid matches. The only caveat being that the input must be xhtml (well formed xml).
You can create a lookahead for each of the attributes and plug them into a regex for the whole tag. For example, the regex for the tag could be
<a\b[^<>]*>
If you're using this on XML you'll probably need something more elaborate. By itself, this base regex will match a tag with zero or more attributes. Then you add a lookhead for each of the attributes you want to match:
(?=[^<>]*\s+class="link")
(?=[^<>]*\s+title="Home")
The [^<>]* lets it scan ahead for the attribute, but won't let it look beyond the closing angle bracket. Matching the leading whitespace here in the lookahead serves two purposes: it's more flexible than matching it in the base regex, and it ensure that we're matching a whole attribute name. Combining them we get:
<a\b(?=[^<>]*\s+class="link")(?=[^<>]*\s+title="Home")[^<>]+>[^<>]+</a>
Of course, I've made some simplifying assumptions for the sake of clarity. I didn't allow for whitespace around the equals signs, for single-quotes or no quotes around the attribute values, or for angle brackets in the attribute values (which I hear is legal, but I've never seen it done). Plugging those leaks (if you need to) will make the regex uglier, but won't require changes to the basic structure.
You could use named groups to pull the attributes out of the tag. Run the regex and then loop over the groups doing whatever tests that you need.
Something like this (untested, using .net regex syntax with the \w for word characters and \s for whitespace):
<a ((?<key>\w+)\s?=\s?['"](?<value>\w+)['"])+ />
The easiest way would be to write a regex that picks up the <a .... > part, and then write two more regexes to pull out the class and the title. Although you could probably do it with a single regex, it would be very complicated, and probably a lot more error prone.
With a single regex you would need something like
<a[^>]*((class="([^"]*)")|(title="([^"]*)"))?((title="([^"]*)")|(class="([^"]*)"))?[^>]*>
Which is just a first hand guess without checking to see if it's even valid. Much easier to just divide and conquer the problem.
An first ad hoc solution might be to do the following.
((class|title)="[^"]*?" *)+
This is far from perfect because it allows every attribute to occur more than once. I could imagine that this might be solveable with assertions. But if you just want to extract the attributes this might already be sufficent.
If you want to match a permutation of a set of elements, you could use a combination of back references and zero-width
negative forward matching.
Say you want to match any one of these six lines:
123-abc-456-def-789-ghi-0AB
123-abc-456-ghi-789-def-0AB
123-def-456-abc-789-ghi-0AB
123-def-456-ghi-789-abc-0AB
123-ghi-456-abc-789-def-0AB
123-ghi-456-def-789-abc-0AB
You can do this with the following regex:
/123-(abc|def|ghi)-456-(?!\1)(abc|def|ghi)-789-(?!\1|\2)(abc|def|ghi)-0AB/
The back references (\1, \2), let you refer to your previous matches, and the zero
width forward matching ((?!...) ) lets you negate a positional match, saying don't match if the
contained matches at this position. Combining the two makes sure that your match is a legit permutation
of the given elements, with each possibility only occuring once.
So, for example, in ruby:
input = <<LINES
123-abc-456-abc-789-abc-0AB
123-abc-456-abc-789-def-0AB
123-abc-456-abc-789-ghi-0AB
123-abc-456-def-789-abc-0AB
123-abc-456-def-789-def-0AB
123-abc-456-def-789-ghi-0AB
123-abc-456-ghi-789-abc-0AB
123-abc-456-ghi-789-def-0AB
123-abc-456-ghi-789-ghi-0AB
123-def-456-abc-789-abc-0AB
123-def-456-abc-789-def-0AB
123-def-456-abc-789-ghi-0AB
123-def-456-def-789-abc-0AB
123-def-456-def-789-def-0AB
123-def-456-def-789-ghi-0AB
123-def-456-ghi-789-abc-0AB
123-def-456-ghi-789-def-0AB
123-def-456-ghi-789-ghi-0AB
123-ghi-456-abc-789-abc-0AB
123-ghi-456-abc-789-def-0AB
123-ghi-456-abc-789-ghi-0AB
123-ghi-456-def-789-abc-0AB
123-ghi-456-def-789-def-0AB
123-ghi-456-def-789-ghi-0AB
123-ghi-456-ghi-789-abc-0AB
123-ghi-456-ghi-789-def-0AB
123-ghi-456-ghi-789-ghi-0AB
LINES
# outputs only the permutations
puts input.grep(/123-(abc|def|ghi)-456-(?!\1)(abc|def|ghi)-789-(?!\1|\2)(abc|def|ghi)-0AB/)
For a permutation of five elements, it would be:
/1-(abc|def|ghi|jkl|mno)-
2-(?!\1)(abc|def|ghi|jkl|mno)-
3-(?!\1|\2)(abc|def|ghi|jkl|mno)-
4-(?!\1|\2|\3)(abc|def|ghi|jkl|mno)-
5-(?!\1|\2|\3|\4)(abc|def|ghi|jkl|mno)-6/x
For your example, the regex would be
/<a href="home.php" (class="link"|title="Home") (?!\1)(class="link"|title="Home")>Home<\/a>/