I have an input XML file which contains normal HTML names for various characters e.g. Double Quote = " etc.
<Notes>Double Quote " Single Quote &pos; Ampersand &</Notes>
Before
<?xml version="1.0" encoding="UTF-8"?>
<OrganisationUnits>
<OrganisationUnitsRow num="8">
<OrganisationId>ACME24/7HOME</OrganisationId>
<OrganisationName>ACME LTD</OrganisationName>
<Notes>Double Quote " Single Quote &pos; Ampersand & </Notes>
<Sector>P</Sector>
<SectorDesc>Private Private & Voluntary</SectorDesc>
</OrganisationUnitsRow>
</OrganisationUnits>
After
<?xml version="1.0" encoding="UTF-8"?>
<OrganisationUnits>
<OrganisationUnitsRow num="8">
<OrganisationId>ACME24/7HOME</OrganisationId>
<OrganisationName>ACME LTD</OrganisationName>
<Notes>Double Quote " Single Quote ' Ampersand &</Notes>
<Sector>P</Sector>
<SectorDesc>Private Private & Voluntary</SectorDesc>
</OrganisationUnitsRow>
</OrganisationUnits>
I am treating the file as XML and it gets processed OK, nothing very fancy.
$xml = [xml](Get-Content $path\$File)
foreach ($CMCAddressesRow in $xml.OrganisationUnits.OrganisationUnitsRow) {
blah
blah
}
$xml.Save("$path\$File")
When the output is saved all the HTML codes like " get replaced by ".
How can I retain the original HTML " characters? And more importantly why is it happening.
What you're referring to is called "character entities". PowerShell converts them on import, so you can work with the actual characters these entities represent, and converts on export only what must be encoded in the XML file. Quotation characters don't need to be encoded in a node value, so they're not being encoded on export.
Related
I have a file that have an HTMl code, the HTML tags are encoded like the following content:
\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e
The decoded HTML should be:
<div data-name="region-name" class="main-id">UK</div>
In Ruby, I used cgi library to unescapeHTML however it does not work because when it read the content it does not identify the encoded tags, here is another example:
require 'cgi'
single_quoted_string = '\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e'
double_quoted_string = "\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e"
puts 'unescape single_quoted_string ' + CGI.unescapeHTML(single_quoted_string)
puts 'unescape double_quoted_string ' + CGI.unescapeHTML(double_quoted_string)
The output of the previous code is:
unescape single_quoted_string \x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e
unescape double_quoted_string <div data-name="region-name" class="main-id">UK</div>
My question is, how can I make the single_quoted_string act as if its content is double-quoted to make the function understand the encoded tags?
Thanks
Ruby's parser allows certain escape sequences in string literals.
The double-quoted string literal "\x3c" is recognized as containing a hexadecimal pattern \xnn which represents the single character <. (0x3C in ASCII)
The single-quoted string literal '\x3c' however is treated literally, i.e. it represents four characters: \, x, 3, and c.
how can I make the single_quoted_string act as if its content is double-quoted
You can't. In order to turn these four characters into < you have to parse the string yourself:
str = '\x3c'
str[2, 2] #=> "3c" take hex part
str[2, 2].hex #=> 60 convert to number
str[2, 2].hex.chr #=> "<" convert to character
You can apply this to gsub:
str = '\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e'
str.gsub(/\\x\h{2}/) { |m| m[2, 2].hex.chr }
#=> "<div data-name=\"region-name\" class=\"main-id\">UK</div>"
/\\x\h{2}/ matches a literal backslash (\\) followed by x and two ({2}) hex characters (\h).
Just for reference, a CGI encoded string would look like this:
str = "<div data-name=\"region-name\" class=\"main-id\">UK</div>"
CGI.escapeHTML(str)
#=> "<div data-name="region-name" class="main-id">UK</div>"
It uses &...; style character references.
Your problem has nothing to do with HTML, \x3c represent the hex number '3c' in the ascii table.
Double-quoted strings look for this patterns and convert them to the desired value, single-quoted strings treat it the final outcome.
You can check for yourself that CGI is not doing anything.
CGI.unescapeHTML(double_quoted_string) == double_quoted_string
The easiest way I know to solve your problem is through gsub
def convert(str)
str.gsub(/\\x(\w\w)/) do
[Regexp.last_match(1)].pack("H*")
end
end
single_quoted_string = '\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e'
puts convert(single_quoted_string)
What convert does is to get every pair of hex escaped values and pack them as characters.
I need to render a string exactly as I get it from the server, for example if I get a string that contains "\t" I need it to be rendered as "\t" and not as space/s.
In the state of the component I see that the string appears with the special characters but rendered without:
state:
'\"id\"\t\"name\n key\"'
what is rendered:
"id" "name key"
How can I prevent this from happening?
Since JS and DOM by default parse special characters such as \n, you can define special characters that you want to prevent from behaving in their default way and replace them with original plus backslash before it:
Take a look at this runnable snippet:
let textWithSpecialChars = `"id"\t"name\n key\f \r \b"`;
const specialChars = ['\\b', '\\r', '\\f', '\\n', '\\t'];
let modifiedTextWithSpecialChars = JSON.stringify(textWithSpecialChars);
specialChars.forEach((char) => {
modifiedTextWithSpecialChars = modifiedTextWithSpecialChars.replace(char, '\\' + char);
});
console.log(modifiedTextWithSpecialChars);
// "\"id\"\\t\"name\\n key\\f \\r \\b\""
console.log(JSON.parse(modifiedTextWithSpecialChars));
// "id"\t"name\n key\f \r \b"
console.log(textWithSpecialChars);
// "id" "name
// key
// "
document.body.innerHTML = JSON.parse(modifiedTextWithSpecialChars)
Analysis
You collect all special characters in that array and escape them.
Stringify your string, in this way you can use that string in JS without JS parsing special characters
Take this string and loop through all special characters that you added above
For every special character take the original string and replace special character such as \n with \\n. Return that value to modifying string and continue until all special characters are replaced. Stringified result of this will be '\"id\"\\t\"name\\n key\\f \\r \\b\"'
Parse your string back and JS will not parse special characters as special characters, rathe they will be plain strings.
I have a csv similar to this (original file is proprietary, cannot share). Separator is Tab.
It contains a description column, whose text is wrapped in double quotes, can contain quoted strings, where, wait for it, escape sequence is also double quote.
id description other_field
12 "Some Description" 34
56 "Some
Multiline
""With Escaped Stuff""
Description" 78
I am parsing the file with this code
let mut reader = csv::ReaderBuilder::new()
.from_reader(file)
.deserialize().unwrap();
I'm consistently getting CSV deserialize error :
CSV deserialize error: record 43747 (line: 43748, byte: 21082563): missing field 'id'
I tried using flexible(true), double_quotes(true) with no luck.
Is it possible to parse this type of field, and if yes, how ?
Actually the issue was unrelated, rust-serde perfectly parses this. Just forgot to define the delimiter (tab in this case). This code works :
let mut reader = csv::ReaderBuilder::new()
.delimiter(b'\t')
.from_reader(file)
.deserialize().unwrap();
I'm getting the following format for my XML data:
<?xml version="1.0" encoding="UTF-8"?>
<n0:MT_NODE_CodingList xmlns:n0="cdcvvrvrv">
<DocumentId>78D6590F-2843-434D-AF0F-76B11680B6AD</DocumentId>
<CodingLines> <CurrentDocumentLineReferenceId>00001</CurrentDocumentLineReferenceId>
<LineID>00001</LineID>
<UUID>3CA6E835-1F8F-4B7F-A255-FCBF766AE1C8</UUID>
<Amount>7000000.00</Amount>
<currencyID>USD</currencyID>
<Quantity>100000.000</Quantity>
<unitCode>KGM</unitCode>
<Codes> <ID>purchasinggroup</ID>
<Name>Reserve for Source</Name>
<Value>530</Value>
</Codes> </CodingLines> </n0:MT_NODE_CodingList>
I would like to convert the XML to the format below:
{"DocumentId":"41DCF8A4-6D05-4A8F-9265-F5E6BCD96CCF",
"CodingLines":{{"CodingLines":[{"ID":{"value":"177AFD35-5EF5-4466-88C6-B4755CC2E1A0"},
"OrderLineReference":[{"LineID":{"value":"00001"}}],
"CurrentDocumentLineReferenceId":"00001",
"Amount":{"value":"100.00 ",
"currencyID":"USD"},
"Quantity":{"value":"100.000 ",
"unitCode":"GIA"},"Codes":[{"ID":{"value":"purchasinggroup"},
"Name":{"value":"Reserve for Source"},
"Value":{"value":108}}]}]}}}
But when I do the conversion from the SAP Application I am getting XML with the format below:
{"DocumentId":"41DCF8A4-6D05-4A8F-9265-F5E6BCD96CCF",
"CodingLines":[{"CodingLines":[{"ID":{"value":"177AFD35-5EF5-4466-88C6-B4755CC2E1A0"},
"OrderLineReference":[{"LineID":{"value":"00001"}}],
"CurrentDocumentLineReferenceId":"00001",
"Amount":{"value":"100.00 ",
"currencyID":"USD"},
"Quantity":{"value":"100.000 ",
"unitCode":"GIA"},
"Codes":[{"ID":{"value":"purchasinggroup"},
"Name":{"value":"Reserve for Source"},
"Value":{"value":108}}]}]}]}
What do I have to do to make the first CodingLines start with '{' and the CodingLines in the level below to have an array '['.
I have XML which is structured similar to the example below and I've written an XQuery in MarkLogic to export this to a CSV (see below the XML).
What I need help with is formatting the output so that when I open the CSV file, instead of having all of the output across 1 I'd like it to be grouped "columns" so to speak.
Let's say for the sample below, I'd like to output all of the DataTime and Source element values and have the values in their own columns like this:
2012-02-15T00:58:26 a
2012-02-15T00:58:26 b
2012-02-15T00:58:26 c
How would I go about that?
Would welcome any reference points or help. Thank you in advance.
Here's the sample XML:
<Document xmlns="http://fakeexample.org/schemas">
<Information>
<ItemId>1f28cb0c2c4f4eb7b13c4abf998e391e</ItemId>
<MediaType>Text</MediaType>
<DocDateTime>2012-02-15T00:58:26</DocDateTime>
</Information>
<FilingData>
<DateTime>2012-02-15T00:58:26</DateTime>
<Source>a</Source>
</FilingData>
<FilingData>
<DateTime>2012-02-15T00:58:27</DateTime>
<Source>b</Source>
</FilingData>
<FilingData>
<DateTime>2012-02-15T00:58:28</DateTime>
<Source>c</Source>
</FilingData>
</Document>
Here's the sample XQuery:
xquery version "1.0-ml";
declare default function namespace "http://www.w3.org/2005/xpath-functions";
declare namespace xdmp="http://marklogic.com/xdmp";
declare namespace exam="http://fakeexample.org/schemas";
declare function local:getDocument($url)
{
let $response := xdmp:document-get($url,
<options xmlns="xdmp:document-get">
<repair>full</repair>
<format>xml</format>
</options>)
return $response
};
xdmp:set-response-content-type("text/csv"),
xdmp:add-response-header(
"Content-disposition",
fn:concat("attachment;filename=", "output", fn:current-time(), ".csv")
),
(
let $q := cts:element-value-query(xs:QName("exam:ItemId"), ("1f28cb0c2c4f4eb7b13c4abf998e391e"))
let $results := cts:search(fn:doc(), $q)
for $result in $results
return fn:string-join((xs:string($result//exam:DateTime),
xs:string($result//exam:Source)
), "," )
)
Replace your for loop with this:
return
string-join(
for $result in $results//FilingData
return fn:string-join((xs:string($result//exam:DateTime),
xs:string($result//exam:Source)
), "," )
, "
")
That should about do the trick..
Edit: note that I added //FilingData behind $results. That makes sure DateTime and Source of each FilingData is joined separately, and returned as separate strings of the for loop. That allows the outer string-join to add the required line ends between them.
Note:
should be translated to OS specific line endings automatically.
Building on the answer by #grtjn:
string-join(..., "
")
Line endings can be treated differently depending on OS or application. You could try alternative characters (either or both):
"
" (LF)
"
" (CR)
Also, this could be thwarted by the application used to view the CSV. For example, most versions of Microsoft Excel will convert all whitespace within a cell - newlines included - into plain spaces.