I like to have access to items/list of words in TextArea, like Word(2). Is there native support for that in TextArea() or some good List object to be used?
E.g.
1stword = TextArea.TextAsList(1)
2ndword = TextArea.TextAsList(2)
Since there is already .htmltext, is there some HTML object that could be used to make such a list easily?
Nope. Just get the text in there and use a String.Split.
Besides, if there were such a method, it would also do the split.
Related
Is it possible to build an XSD that will treat any tag's contents just as text? I am trying to extract a tag's contents that sometimes contains HTML tags. There is no fixed pattern to the html and is not always present. I just want to extract all the text from within the tags. e.g. <content>this is a new piece of content by <b>Person A</b></content>. I want to extract just "this is a new piece of content by <b>Person A</b>" but the schema generated by SSIS naturally includes these tags. When I just add a simple entry
<xs:element minOccurs="0" name="content" type="xs:string"></xs:element>
I get the following error which is not unexpected.
[XML Source [5]] Error: The XML Source was unable to process the XML
data. The element "content" cannot contain a child element. Content
model is text only.
You're not distinguishing very clearly between the schema you are writing to describe and constrain your data (and, I assume, guide SSIS in various ways) and the executable code you will at some point want to write in order to extract the data you want at a particular moment. There are several things you seem to want or need:
To allow unconstrained XML within an element, you'll want a wildcard; read up on the xsd:any element.
To extract just the text within an element, you'll want the XPath string() function (but note that your example "this is a new piece of content by <b>Person A</b>" is not just the text of content but contains a child element).
To extract a serialized XML representation of the content of the content element (which is what you apparently want, in contrast to what you say you want), you'll want to serialize the contents; there are a variety of ways to do that.
Think of the XSD primarily as describing allowed markup in a valid XML document rather than as a method to define extraction. If you change the type of content to xs:string, you're declaring that markup is not permitted within content, only text, and the validation error you're getting reflects that.
What you want is to select the string value of the content element. If the context for an XPath doesn't automatically convert its results to a string value, you can do so explicitly via the string() XPath function:
string(/path/to/particular/content)
This will return the concatenation of the string values of all of the children of content, omitting the tags as requested.
Update: Re-reading your question, I see that you actually want to retrieve
"this is a new piece of content by <b>Person A</b>"
(including the b element, not its string value). Here, the wrapping content element clearly has to be described in the XSD as having mixed content (mixed="true"). Extracting this data from an XML document in this form would typically involve selecting a collection of text and elements nodes, and serializing these back to a single string. I am not familiar enough with SSIS to provide details, but perhaps the reference I mentioned in the comments could help.
I want to be able to apply non-style attributes to sections of text in a TextField. For example characters 30-45 will be set to animate in a certain direction.
As this field is editable characters 30-45 may no longer be at 30-45 if the text is edited in any way.
Can anyone think of an elegant way to keep track of which characters had the attributes applied to them?
I've had a similar project and ended up extending the TextField class to fit my needs. Here's a short description of what's to do - my actual code is confidential, I'm afraid:
Override the setters for text and htmlText
Parse any content from these setters into an array of custom objects. Each of these objects contains raw text chunks and the metadata that applies to them (format, comments, etc.).
For example,
<span class="sometext" animation="true">Info</span>
would be translated to an object like this:
{ text:"Info", clazz="sometext", animation:true };
The actual text output is then rendered by using appendText to add chunk by chunk of the raw text and using setTextFormat to apply formatting (or do whatever else is necessary) after each append step.
Add event listeners to react on TEXT_INPUT and/or KEY_DOWN/KEY_UP events to catch any new user input. (You will replace the entire text content of your TextField over and over again, so it's not an option to use super.text.)
User input is processed by using selectionBeginIndex and selectionEndIndex (count the number of characters in the raw text of your object array to find out which chunks are affected). Add or replace the new text directly within the container objects, then use step 3. to refresh the entire text in the TextField.
I have also added a method that reduces the array before it is rendered (i.e. combine adjacent chunks with identical metadata). This keeps the array lean and helps creating XML output that does not have a complicated tree structure (one-dimensional is quite what we like for this kind of scenario).
Override the getters for text and htmlText to return the newly formatted info, if you need the results somewhere else. I've used htmlText to return a fully decorated xml string and kept text for accessing the raw text content, just like in a generic TextField.
I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE
I have been looking for a way to capture structured text (sections, paragraphs, emphasis, lists, etc.) in JSON, but I haven't found anything yet. Any suggestions? (Markdown crossed my mind, but there might be something better out there.)
How about something like this:
[ { "heading": "Foobar Example" },
{ "paragraph":
[
"This is normal text, followed by... ",
{ "bold": "some bold text" },
"etc."
]
}
]
That is:
use a string for plain text without formatting or other mark-up;
use an array whenever you want to indicate an ordered sequence of certain text elements;
use an object where the key indicates the mark-up and the value the text element to which the formatting is applied.
HTML is a well-established way to describe structured text, in a plain-text format(!). Markdown (as you mentioned) would work as well.
My view is that your best bet is probably going to be using some sort of plain-text markup such as those choices, and place your text in a single JSON string variable. Depending on your application, it may make sense to have an array of sections, containing an array of paragraphs, containing an array of normal/bold/list sections etc. However, in the general case I think good old-fashioned blocks are markup will ironically be cleaner and more scalable, due to the ease of passing them around, and the well-developed libraries for full-blown parsing if/when required.
There also seems to be a specification that might accomplish this Markdown Syntax for Object Notation (MSON)
Not sure if for you it's worth the trouble of implementing the spec, but it seems to be an option.
What is the best format to communicate an array of strings in one string to users who are not geeks?
I could do it like this:
Item1, Item2, Item3
But that becomes meaningless when the strings contain spaces and commas.
I could also do it this way:
"Item1", "Item2", "Item3"
However, I would like to avoid escaping the array elements because escaped characters can be confusing to the uninitiated.
Edit: I should have clarified that I need the formatted string to be one-line. Basically, I have a list of lists displayed in a .Net Winforms ListView (although this question is language-agnostic). I need to show the users a one-line "snapshot" of the list next to the list's name in the ListView, so they get a general idea of what the list contains.
You can pick a character like pipe (|) which are not used much outside programs. It also used in wiki markup for tables which may be intuitive to those who are familiar with wiki markup.
Item1| Item2| Item3
In a GUI or color TUI, shade each element individually. In a monochrome TUI, add a couple of spaces and advance to the next tab position (\t) between each word.
Using JSON, the above list would look like:
'["Item1", "Item2", "Item3"]'.
This is unambiguous and a syntax in widespread use. Just explain the nested syntax a little bit and they'll probably get it.
Of course, if this is to be displayed in a UI, then you don't necessarily want unambiguous syntax as much as you want it to actually look like something intended for the end user. In that case it would depend exactly how you are displaying this to the user.
Display each element as a cell in a table.
How about line breaks after each string? :>
Display each string on a separate line, with line numbers:
1. Make a list
2. Check it twice
3. Say something nice
It's the way people write lists in the real world, y'know :)
Use some kind of typographical convention, for example a bold hashmark and space between strings.
milk # eggs # bread # apples # lettuce # carrots
CSV. Because the very first thing your non-technical user is going to do with delimited data is import it into a spreadsheet.