R, Regex, and Matching the Choice of a Qualtrics Response Column - html

When you export response data from Qualtrics as a CSV, the 2nd row of the data contains strings with the question stem (shortened if necessary), followed by a dash, followed by that response column's corresponding choice. As an example, if my question were "Please select all of the fruit you enjoy:", in my response data the second row of a response column to this question might contain something like "Please select all of the fruit you enjoy:-Blueberries".
Qualtrics shortens the question stem if it is longer than 100 characters. If it is more than 100 characters, the stem is cut off after the 99th character, "..." is appended, and then the dash, and then the choice text.
I am trying to retrieve the text that is after this dash. However, that's difficult, because both the choice text and the question text could contain dashes. I have thought of two different approaches I could take in attempting to select just the choice text:
I have the question text, and can reliably programmatically retrieve it based on the response column name. However, the question text doesn't always match exactly, because Qualtrics removes any HTML styling in the Question text in the response data, but not in the Qualtrics survey file that I am getting the question text from. For questions that don't have any HTML styling, I was thinking about trying to use the question text to somehow match up to and including the dash between the question text and the choice text. I think regex could handle this case fine, but this clearly doesn't work without heavy modification for any questions that have HTML components.
The alternative I think might be more reliable. Strip the question text from the QSF file of any HTML tags, and then count how many "-" characters appear in the question text. Call that n, and then match the 2nd-row-response-entry for up to the n+1th dash, remove it, and what's remaining is my choice text.
I think the 2nd option is much more likely to work consistently, since the first option leaves me with a case where I have to try and strip html from the question text in exactly the same way Qualtrics does, unless I use fuzzy matching (which I know nothing about). However, the second option is also unclear to me.
an example csv response set
For example, the first question's question text looks like this in the QSF:
"<div style=\"text-align: center;\">Click to write the question text
<span style=\"font-size: 10.8333px;\">thsi<sup>tasdf<em>werasfd</em></sup>
<em>sdfad</em></span><br />\n </div>"
I would appreciate both of the following: advice on which option (or a suggestion for another) you think has the most chance for success, and help with the regex in R for matching the text up to the n+1th "-" character.

Here's a solution that counts the dashes in the question, locates the nth dash in the text (if any) and drops the preceding characters, and then keeps the substring that follows the next dash in the text.
stem_text <- "Please--select your extracurriculars"
s <- "<em>Please</em>--select your extracurriculars-student-athletics"
# count dashes in question stem
stem_dash_n <- length(gregexpr("-", stem_text)[[1]])
# locate dashes in string
s_dashes <- gregexpr("-", s)[[1]]
sub_start <- ifelse(length(s_dashes), s_dashes[stem_dash_n], 1)
s_sub <- substr(s, sub_start + 1, nchar(s))
sub("[^\\-]*\\-(.*)", "\\1", s_sub, perl = TRUE)
# [1] "student-athletics"
Assumptions: based on your description, length(s_dashes) >= stem_dash_n, so s_dashes[stem_dash_n] exists; the same number of dashes appear in the known stems and their representations in the text; and there is always a dash separating the stem and response choice.

Related

How to match text and skip HTML tags using a regular expression?

I have a bunch of records in a QuickBase table that contain a rich text field. In other words, they each contain some paragraphs of text intermingled with HTML tags like <p>, <strong>, etc.
I need to migrate the records to a new table where the corresponding field is a plain text field. For this, I would like to strip out all HTML tags and leave only the text in the field values.
For example, from the below input, I would expect to extract just a small example link to a webpage:
<p>just a small <a href="#">
example</a> link</p><p>to a webpage</p>
As I am trying to get this done quickly and without coding or using an external tool, I am constrained to using Quickbase Pipelines' Text channel tool. The way it works is that I define a regex pattern and it outputs only the bits that match the pattern.
So far I've been able to come up with this regular expression (Python-flavored as QB's backend is written in Python) that correctly does the exact opposite of what I need. I.e. it matches only the HTML tags:
/(<[^>]*>)/
In a sense, I need the negative image of this expression but have not be able to build it myself.
Your help in "negating" the above expression is most appreciated.
Assuming there are no < or > elsewhere or entity-encoded, an idea using a lookbehind.
(?:(?<=>)|^)[^<]+
See this demo at regex101
(?:(?<=>)|^) is an alternation between either ^ start of the string or looking behind for any >. From there [^<]+ matches one or more characters that are not < (negated character class).

R Stargazer table ASCII text output formatting (line break, alignment & reference group)

For better or worse, I don't use LaTeX (yet). I like producing stargazer formatted tables on the fly for class examples in both HTML and in the console. However, I'm having trouble with 3 formatting elements; so far I've found solutions for LaTeX and some in HTML, but the ASCII console text eludes me.
The 3 challenges are:
Breaking a line so that a variable name can wrap instead of increasing the table width.
Aligning coefficients & std. errors at the decimal, even when there are p-value stars.
Making space in the covariate labels & coefficients to allow for a reference group.
Let's start with some reproducible data & outputs to reference.
set.seed(3); x1 <- factor(sample(letters[1:4], 1000, replace=TRUE))
set.seed(4); x2 <- runif(1000, -10, 10)
set.seed(5); x3 <- rbinom(1000, size = 1, prob = 0.13)
set.seed(6); y <- runif(1000, -10, 10)
model <- (lm(y ~ x1 + x2 + x3))
stargazer(model, align=TRUE,
#type="html", out="SO_stargazer.html",
type="text", out="SO_stargazer.txt",
title="Example Title Goes Here",
dep.var.caption="",
dep.var.labels="This is my long title for the Dependent Variable Y",
covariate.labels=c("X1 Group B",
"X1 Group C",
"X1 Group D",
"X2 with a super ridiculous and annoyingly long name",
"X3"))
Line break
My default approach is to use \n in the character string. For example, I might try to break the DV caption:
dep.var.labels="This is my long title for \n the Dependent Variable Y",
But that generates the following error message:
Error in if (nchar(text.matrix[r, c]) > max.length[real.c]) { : missing value where TRUE/FALSE needed
Found a couple posts about this issue (here which reference here), but the poster on the first did not provide much of an example to follow and the second pertained to an underscore that I don't have or gave LaTeX solutions. The only difference that broke what already worked was the addition of the \n. I did try using the tex \\ escape, but that didn't do anything useful for text output.
I am able to get line breaks using <br> in the string for the html output file version.
This post also mentions the tex and html solutions, but not text.
Alignment on the decimal
When there are no statistical significance stars on coefficients, both the coefficients and std. errors align nicely, centered on the decimal point. However, once the stars appear, it 'pushes' the coefficient to the left. This happens in both the text and html output. This is not so bad with 1 star, but 3 stars can be quite a difference. How can I coerce it back to align on the decimal value for both formats? This issue persists even if I use the single.row=TRUE option. This post answer by #Marco Doe has a great visual of what I'm talking about, but noted the centering is for tex. Found a LaTeX solution, but no mention of the other formats on that post. I've tinkered with the align and float options to no avail (inspired by these quasi-related tex solution posts here and here). The latter post hinted at using xtable or post-process edits, but that was more than 5 years ago; so I'm hoping for an updated viable solution.
This image is from Marco Doe's solution and shows the LaTeX output, but does a good job showing an example output formats I get (left) and what I would like to have (right).
Reference categories
Found a LaTex solution, that 'pushes' the covariates & coeffient data down a row, making room for a reference group to be printed in the covariate column; however, the solution is in tex. How can I replicate this for the text output? Can I replicate it for HTML version as part of the R code without having to get surgical with the HTML output code?
#Giac posted the images (linked above) to illustrate the have (left) and want (right). Although these images are tex, how could I get the right image output in text and html?

Replace text next to static text google script

I would like to replace the text in a google doc. At the moment I have place markers as follows
Invoice ##invoiceNumber##
I replace the invoice number with
body.replaceText('##invoiceNumber##',invoiceNumber);
Which is fine but I can only run the script once as obviously ##invoiceNumber## is no longer in the document. I was thinking I could replace the text after Invoice as this will stay the same, appendParagraph looks like it might to the trick but I can't figure it out. I think something like body.appendParagraph("Invoice") would select the area? Not sure how to append to this after that.
You could try something like this I think:
body.replaceText('InvoiceNumber \\w{1,9} ','InvoiceNumber ' + invoicenumber);
I don't know how big your invoice numbers are but that will except from 1 to 9 word characters preceeded by a space and followed by a space. That pattern might have to be modified depending upon your textual needs.
Word Characters [A-Za-z0-9_]
If your invoice numbers are unique enough perhaps you could just replace them.
Reference
Regular Expression Syntax
Note: the regex pattern is passed as a string rather than a regular expression

How to replace a word in html only if some conditions are met with regex

I try to replace every occurrence of a word in a text (which is a html file) and everything around until we meet a " or a ' or a ( for behind or a ) for forward with a regex using nodejs.
My problem is that when I have two words to replace let's say 3.png and 13.png, 13.png is being replaced too by matching 3.png and when I come to replace 13.png in my text it's not there because it was already replaced when matching previous 3.png.
My ideal solution would be :
if matched pattern contains a /
then it must exact match after / and replace everything around (slash included) until we meet one of these characters (excluded) " or a ' or a ( for behind or a )
else exact match between "" or '' or ()
You can find here a regex101 example
Currently I'm sorting my words to search like so:
imgjson.sort((a, b) => b.name.length - a.name.length);
in order to replace the longest words first which solves my problem because we replace 13.png first then 3.png but I would like to know if this can be done with js regex?
Thanks a lot for your reply and time!
As #PushpeshKumarRajwanshi told use \b.
If you want to be more accurate and informed about regex, you can use https://regex101.com/.
In right-bottom corner you can find all special characters and functions of regex you may be need to use.

How to generate hash from ~200k text/html that would match/compare to similar text?

I would like to make a sort of hash key out of a text (in my case html) that would match/compare to the hash of other similar text
ex of matching texts:
"2012/10/01 This is my webpage #1"+ 100k_of_same_text + random_words_1 + ..
"2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_2 + ..
...
"2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_3 + ..
So far I've thought of removing numbers and tags but that wold still leave the random words.
Is there anything out there that dose this?
I have root access to the server so I can add any UDF that is necesare and if needed I can do the processing in c or other languages.
The ideal would be a function like generateSimilarHash(text) and an other function compareSimilarHashes(hash1,hash2) that would return the procent of matching text.
Any function like compare(text1,text2) would not work as in my case as I have many pages to compare (~20 mil at the moment)
Any advice is welcomed!
UPDATE:
I'm refering to ahash function as it is described on wikipedia:
A hash function is any algorithm or subroutine that maps large data
sets of variable length to smaller data sets of a fixed length.
the fixed length part is not necessary in my case.
It sounds like you need to utilize a program like diff.
If you are just trying to compare text a hash is not the way to go because slight differences in input cause total and complete differnces in output. (Thus the reason why they are used to encode passwords, and secure text). Character difference programs are pretty complicated, unless you really are interested in how they work and are trying to write your own I would just use a solution like the one that is shown here using sdiff to get a percentage.
Percentage value with GNU Diff
You could use some sort of Levenshtein distance algoritm. this works for small pieces of text, but I'm rather sure that something similar can be applied to large chunks of text.
Ref: http://en.m.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance
I've found out that tag order in webpages can create a very distinctive pattern, that remains the same even if portions of text / css / script change. So I've made a string generated by the tag order (ex: html head meta title body div table tr td span bold... => "hhmtbdttsb...") and then I just do exact matches between these strings. I can even apply the Levenshtein distance algorithm and get accurate results.
If I didn't have html, I would have used the punctuation/end-lines for splitting, or something similar.