HTML text input plugin for CSV data preserving column alignment? - html

When pasting CSV data into an HTML text area, or in all the jquery rich text editors I could find, data is visually "messed up": columns alignment is lost, in particular when a cell is long and the one below is very short.
Is anyone aware of some kind of plugin similar to a text area that would visually preserve columns alignment when pasting some CSV data into it? That would require interpreting the tabs as columns separators, and not just as a fixed number of spaces.
Thanks!

Write a script to replace the tabs for celltags and the newlines for rowtags like:
r='foo\tbar\tfoobar\nbar\tfoo\tfoobar\nfoobar\tbar\tfoo';
r = r.replace(/\n/g,'</td></tr><tr><td>');
r = r.replace(/\t/g, '</td><td>');
r = '<table><tr><td>' + r + '</td></tr></table>';
alert(r);

Related

Parse txt eliminate header and footer Apache-Hop

I´m trying Apache HOP Desktop and I´m stuck in a pipeline where I have to clean and parse this one column txt. First eliminating header and footer then parsing according to position and lenght.
name: Initial_Pos: Lenght:
N1 2 10
N2 12 24
N3 108 30
i´m trying to use 'Text File Input' as a transform but no luck. Any suggestions?
Thanks in advance
The text file input transform has a wide variety of options you can use.
Under the content tab you can set the filetype to a fixed with file, add the number of header/footer lines and special things like encoding.
text file input content tab
Then under the fields tab you can set the position and length of each field it has to read.
After a bit of fiddling you should be able to extract the information you need.

Word html format: insert a custom TOC via field code

I am generating Word docs from html. Basically, I build a file with html and save it as a .doc. Then I open it in Word and apply a template. All good so far.
I would like to automatically generate a custom TOC via the HTML ie when I am building the document. I need to insert a field code to do that, in the same way I do to add page numbering via the HML. eg:
<span style="mso-field-code: PAGE " class="page-field"></span>
If I save my html doc as docx and apply a template, I can make a TOC based in the styles in the way one would normally create a TOC in Word. I customised the TOC so the Title style is the top level followed by H1, H2 then H3. If I then toggle the field code on the TOC, the field code looks like this:
{ TOC \t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1" }
Now, I can add HTML like this to insert the TOC:
<div style="mso-field-code: TOC " class="toc-field">TOC goes HERE</div>
When I do that, if I right click the text "TOC goes HERE" I get the option to "Update field" and if I do that a TOC is generated using the default H1,H2,H3 tags.
But, what I can't work out is how to include the
\t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1"
part so my custom style sequence is applied. I have tried all sorts of combinations and it seems that adding anything after TOC causes Word to not make a field code.
Does anyone have any suggestions?
Update:
Based on the essential help from #slightlysnarky below, I thought I would summarise the outcome here because the information I needed was in a Microsoft chm file that was taken down many years ago. If you read the following extract from that help manual and compare it to the solution below you will see how this all works.
Word marks and stores information for simple fields by means of the Span element with the mso-field-code style. The mso-field-code value represents the string value of the field code. Formatting in the original field code might be lost when saving as HTML if only the string value of the code is necessary for its calculation.
Word has a different way of storing field information to HTML for more complex fields, such as ones that have formatted text or long values. Word marks these fields with so the data is not displayed in the browser. Word uses the Span element with the mso-element: field-begin, mso-element: field-separator, and mso-element: field-end attributes to contain the three respective parts of the field code: the field start, the separator between field code and field results, and the field end. Whenever possible, Word will save the field to HTML in the method that uses the least file space.
So, basically, add tags as shown below to your HTML at the point you wish the TOC to appear.
:-)
Word recognises a "complex field format" in HTML, along the same lines as it does in the Office Open XML format. So you can use
<span style='mso-element:field-begin'></span>TOC \t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1"
<span style='mso-element:field-separator'></span>This text will show but the user will need to update the field
<span style='mso-element:field-end'></span>
This construct is outlined in a Microsoft document called "Microsoft Office HTML and XML Reference". It's a Windows .exe that unpacks to a .chm Help file. You can get it here
The info. on encoding fields is in Getting Started with Microsoft Office 2000 HTML and XML->Microsoft Word->Fields
There may be a later version but that's the only one I could find.

Use R to extract sections of HTML document using <b> to indicate section header

I have a few thousand large documents saved locally, where they are all saved as HTML files. Each document is about 300 pages long, and has some sections that have titles in bold letters. My goal is to do a text search in these files, and when I find the given phrase, extract the whole section that contains this phrase. My idea was to parse the html text so that it becomes a list of paragraphs, find the location of the phrase, and then extract everything from the bold letters (title of this section) just prior to bold letters just after (title of the next section).
I tried in a number of different ways, but none of them does what I want. the following was promising:
myhtmlfile = "I:/myfolder/myfile.html"
myhtmltxt2 = htmlTreeParse(myhtmlfile, useInternal = TRUE)
But while I can display the object "myhtmltxt2" and it looks like html with tags (which is what I need so that I can look for "<b>" ), it is an external pointer. So then I am not able to the command below, because grep does not work on pointers.
test2<-grep("myphrase",myhtmltxt2,ignore.case = T)
Alternatively, I did this:
doc.text = unlist(xpathApply(myhtmltxt2, '//p', xmlValue))
test3<-grep("myphrase",doc.text,ignore.case = T)
But in this case, I lost html tags in doc.text, so I no longer have "<b>" which is what I was going to use to indicate section to extract. Is there a way of doing this?
I managed this by following:
singleString <- paste(readLines(myHTMLfile), collapse=" ")
data11 = strsplit(singleString,"<p><b>", fixed = TRUE)
test2<- unlist(data11)
myindex<-regexpr("Myphrase </b>", test2)

Csv hebrew text not in good order

I am trying to import csv file to use the data in my php project to insert them in mysql database. The problem is that my csv file contains one column woth hebrew character. This csv converted from xls file.
The problem is that when i open the file with excel i have correct display, like that
But when i am trying to use the csv file. I have a problem of order
פרקט תלת שכבתי אלון 189x15/4 גרי ישן מעושן גימור שמן UV
Somebody know how to resolve this problem thanks!
The problem is not in my php script. My php script is all right. But the problem is that the xcel cell format not correspond when i use it in csv.
The problem is when an English word or number is mixed in with the text:
Example:
English:
“Can we improve the health of patients by giving them Aspirin?”
Hebrew:
“[Hebrew translated text] Aspirin?”
This is displayed as:
Aspirin [Hebrew translated text]?
Hopefully I explained the issue enough. It is a little confusing so if I need clarify more, please let me know.
Any help or experience is appreciated?
As an RTL language speaker, I think I can be of help.
It all depends on the text direction the UI is using. Most of the application uses LTR (Left-to-Right) for text direction by default. If you are using MySQL Workbench to see the values stored in the column, MySQL Workbench uses LTR direction as well. That's why you will see the wrong order problem when you have bi-directional (text mixed with numbers) text.
Keep in mind, that CSV is merely a UTF-8 plain text, which means the text is style-less and direction-less. You need only to set your HTML direction to RTL. See example below:
<h3>Wrong LTR Direction</h3>
<p dir="ltr">פרקט תלת שכבתי אלון 189x15/4 גרי ישן מעושן גימור שמן UV</p>
<h3>Correct RTL Direction</h3>
<p dir="rtl">פרקט תלת שכבתי אלון 189x15/4 גרי ישן מעושן גימור שמן UV</p>
Salam :)

R, Regex, and Matching the Choice of a Qualtrics Response Column

When you export response data from Qualtrics as a CSV, the 2nd row of the data contains strings with the question stem (shortened if necessary), followed by a dash, followed by that response column's corresponding choice. As an example, if my question were "Please select all of the fruit you enjoy:", in my response data the second row of a response column to this question might contain something like "Please select all of the fruit you enjoy:-Blueberries".
Qualtrics shortens the question stem if it is longer than 100 characters. If it is more than 100 characters, the stem is cut off after the 99th character, "..." is appended, and then the dash, and then the choice text.
I am trying to retrieve the text that is after this dash. However, that's difficult, because both the choice text and the question text could contain dashes. I have thought of two different approaches I could take in attempting to select just the choice text:
I have the question text, and can reliably programmatically retrieve it based on the response column name. However, the question text doesn't always match exactly, because Qualtrics removes any HTML styling in the Question text in the response data, but not in the Qualtrics survey file that I am getting the question text from. For questions that don't have any HTML styling, I was thinking about trying to use the question text to somehow match up to and including the dash between the question text and the choice text. I think regex could handle this case fine, but this clearly doesn't work without heavy modification for any questions that have HTML components.
The alternative I think might be more reliable. Strip the question text from the QSF file of any HTML tags, and then count how many "-" characters appear in the question text. Call that n, and then match the 2nd-row-response-entry for up to the n+1th dash, remove it, and what's remaining is my choice text.
I think the 2nd option is much more likely to work consistently, since the first option leaves me with a case where I have to try and strip html from the question text in exactly the same way Qualtrics does, unless I use fuzzy matching (which I know nothing about). However, the second option is also unclear to me.
an example csv response set
For example, the first question's question text looks like this in the QSF:
"<div style=\"text-align: center;\">Click to write the question text
<span style=\"font-size: 10.8333px;\">thsi<sup>tasdf<em>werasfd</em></sup>
<em>sdfad</em></span><br />\n </div>"
I would appreciate both of the following: advice on which option (or a suggestion for another) you think has the most chance for success, and help with the regex in R for matching the text up to the n+1th "-" character.
Here's a solution that counts the dashes in the question, locates the nth dash in the text (if any) and drops the preceding characters, and then keeps the substring that follows the next dash in the text.
stem_text <- "Please--select your extracurriculars"
s <- "<em>Please</em>--select your extracurriculars-student-athletics"
# count dashes in question stem
stem_dash_n <- length(gregexpr("-", stem_text)[[1]])
# locate dashes in string
s_dashes <- gregexpr("-", s)[[1]]
sub_start <- ifelse(length(s_dashes), s_dashes[stem_dash_n], 1)
s_sub <- substr(s, sub_start + 1, nchar(s))
sub("[^\\-]*\\-(.*)", "\\1", s_sub, perl = TRUE)
# [1] "student-athletics"
Assumptions: based on your description, length(s_dashes) >= stem_dash_n, so s_dashes[stem_dash_n] exists; the same number of dashes appear in the known stems and their representations in the text; and there is always a dash separating the stem and response choice.