How to convert cyrillic into utf16 - json

tl;dr Is there a way to convert cyrillic stored in hashtable into UTF-16?
Like кириллица into \u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430
I need to import file, parse it into id and value then convert it into .json and now im struggling to find a way to convert value into utf codes.
And yes, it is needed that way
cyrillic.txt:
1 кириллица
PH:
clear-host
foreach ($line in (Get-Content C:\Users\users\Downloads\cyrillic.txt)){
$nline = $line.Split(' ', 2)
$properties = #{
'id'= $nline[0] #stores "1" from file
'value'=$nline[1] #stores "кириллица" from file
}
$temp+=New-Object PSObject -Property $properties
}
$temp | ConvertTo-Json | Out-File "C:\Users\user\Downloads\data.json"
Output:
[
{
"id": "1",
"value": "кириллица"
},
]
Needed:
[
{
"id": "1",
"value": "\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"
},
]
At this point as a newcomer to PH i have no idea even how to search for it properly

Building on Jeroen Mostert's helpful comment, the following works robustly, assuming that the input file contains no NUL characters (which is usually a safe assumption for text files):
# Sample value pair; loop over file lines omitted for brevity.
$nline = '1 кириллица'.Split(' ', 2)
$properties = [ordered] #{
id = $nline[0]
# Insert aux. NUL characters before the 4-digit hex representations of each
# code unit, to be removed later.
value = -join ([uint16[]] [char[]] $nline[1]).ForEach({ "`0{0:x4}" -f $_ })
}
# Convert to JSON, then remove the escaped representations of the aux. NUL chars.,
# resulting in proper JSON escape sequences.
# Note: ... | Out-File ... omitted.
(ConvertTo-Json #($properties)) -replace '\\u0000', '\u'
Output (pipe to ConvertFrom-Json to verify that it works):
[
{
"id": "1",
"value": "\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"
}
]
Explanation:
[uint16[]] [char[]] $nline[1] converts the [char] instances of the strings stored in $nline[1] into the underlying UTF-16 code units (a .NET [char] is an unsigned 16-bit integer encoding a Unicode code point).
Note that this works even with Unicode characters that have code points above 0xFFFF, i.e. that are too large to fit into a [uint16]. Such characters outside the so-called BMP (Basic Multilingual Plane), e.g. 👍, are simply represented as pairs of UTF-16 code units, so-called surrogate pairs, which a JSON processor should recognize (ConvertFrom-Json does).
However, on Windows such chars. may not render correctly, depending on your console window's font. The safest option is to use Windows Terminal, available in the Microsoft Store
The call to the .ForEach() array method processes each resulting code unit:
"`0{0:x4}" -f $_ uses an expandable string to create a string that starts with a NUL character ("`0"), followed by a 4-digit hex. representation (x4) of the code unit at hand, created via -f, the format operator.
This trick of replacing what should ultimately be a verbatim \u prefix temporarily with a NUL character is needed, because a verbatim \ embedded in a string value would invariably be doubled in its JSON representation, given that \ acts the escape character in JSON.
The result is something like "<NUL>043a", which ConvertTo-Json transforms as follows, given that it must escape each NUL character as \u0000:
"\u0000043a"
The result from ConvertTo-Json can then be transformed into the desired escape sequences simply by replacing \u0000 (escaped as \\u0000 for use with the regex-based -replace oeprator) with \u, e.g.:
"\u0000043a" -replace '\\u0000', '\u' # -> "\u043a", i.e. к

Here's a way simply saving it to a utf16be file and then reading out the bytes, and formatting it, skipping the first 2 bytes, which is the bom (\ufeff). $_ didn't work by itself. Note that there's two utf16 encodings that have different byte orders, big endian and little endian. The range of cyrillic is U+0400..U+04FF. Added -nonewline.
'кириллица' | set-content utf16be.txt -encoding BigEndianUnicode -nonewline
$list = get-content utf16be.txt -Encoding Byte -readcount 2 |
% { '\u{0:x2}{1:x2}' -f $_[0],$_[1] } | select -skip 1
-join $list
\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430

There must be a simpler way of doing this, but this could work for you:
$temp = foreach ($line in (Get-Content -Path 'C:\Users\users\Downloads\cyrillic.txt')){
$nline = $line.Split(' ', 2)
# output an object straight away so it gets collected in variable $temp
[PsCustomObject]#{
id = $nline[0] #stores "1" from file
value = (([system.Text.Encoding]::BigEndianUnicode.GetBytes($nline[1]) |
ForEach-Object {'{0:x2}' -f $_ }) -join '' -split '(.{4})' -ne '' |
ForEach-Object { '\u{0}' -f $_ }) -join ''
}
}
($temp | ConvertTo-Json) -replace '\\\\u', '\u' | Out-File 'C:\Users\user\Downloads\data.json'
Simpler using .ToCharArray():
$temp = foreach ($line in (Get-Content -Path 'C:\Users\users\Downloads\cyrillic.txt')){
$nline = $line.Split(' ', 2)
# output an object straight away so it gets collected in variable $temp
[PsCustomObject]#{
id = $nline[0] #stores "1" from file
value = ($nline[1].ToCharArray() | ForEach-Object {'\u{0:x4}' -f [uint16]$_ }) -join ''
}
}
($temp | ConvertTo-Json) -replace '\\\\u', '\u' | Out-File 'C:\Users\user\Downloads\data.json'
Value "кириллица" will be converted to \u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430

Related

& sign is converted into \u0026 through powershell

I have below code:
$getvalue= 'true&replicaSet=users-shard-0&authSource=adsfsdfin&readPreference=neasrest&maxPoolSize=50&minPoolSize=10&maxIdleTimeMS=60'
$getvalue = $getvalue -replace '&','&'
$pathToJson = 'C:\1\test.json'
$a = Get-content -Path $pathToJson | ConvertFrom-Json
$a.connectionStrings.serverstring=$getvalue
$a | ConvertTo-Json | Set-content $pathToJson -ErrorAction SilentlyContinue
I got below result:
true\u0026replicaSet=users-shard-0\u0026authSource=adsfsdfin\u0026readPreference=neasrest\u0026maxPoolSize=50\u0026minPoolSize=10\u0026maxIdleTimeMS=60
There & sign converted into \u0026. How to prevent covert value.
You can take reference from this question
I need & sign in json file instead of \u0026
Windows PowerShell's ConvertTo-Json unexpectedly serializes & to its equivalent Unicode escape sequence (\u0026); ditto for ', < and > (fortunately, this no longer happens in PowerShell (Core) 7+) - while unexpected and hindering readability - this isn't a problem for programmatic processing, since JSON parsers, including ConvertFrom-Json do recognize such escape sequences:
($json = 'a & b' | ConvertTo-Json) # -> `"a \u0026 b"` (WinPS)
ConvertFrom-Json $json # -> verbatim `a & b`, i.e. successful roundtrip
If you do want to convert such escape sequences to the verbatim character they represent:
This answer to the linked question shows a robust, general string-substitution approach.
However, in your case - given that you know the specific and only Unicode sequence to replace and there seems to be no risk of false positives - you can simply use another -replace operation:
$getvalue= 'true&replicaSet=users-shard-0&authSource=adsfsdfin&readPreference=neasrest&maxPoolSize=50&minPoolSize=10&maxIdleTimeMS=60'
$getvalue = $getvalue -replace '&','&'
# Simulate reading an object from a JSON
# and update one of its properties with the string of interest.
$a = [pscustomobject] #{
connectionStrings = [pscustomobject] #{
serverstring = $getValue
}
}
# Convert the object back to JSON and translate '\\u0026' into '&'.
# ... | Set-Content omitted for brevity.
($a | ConvertTo-Json) -replace '\\u0026', '&'
Output (note how the \u0026 instance were replaced with &):
{
"connectionStrings": {
"serverstring": "true&replicaSet=users-shard-0&authSource=adsfsdfin&readPreference=neasrest&maxPoolSize=50&minPoolSize=10&maxIdleTimeMS=60"
}
}
You can cover all problematic characters - & ', < and > - with multiple -replace operations:
However, if you need to rule out false positives (e.g., \\u0026), the more sophisticated solution from the aforementioned answer is required.
# Note: Use only if false positives aren't a concern.
# Sample input string that serializes to:
# "I\u0027m \u003cfine\u003e \u0026 dandy."
($json = "I'm <fine> & dandy." | ConvertTo-Json)
# Transform the Unicode escape sequences for chars. & ' < >
# back into those chars.
$json -replace '\\u0026', '&' -replace '\\u0027', "'" -replace '\\u003c', '<' -replace '\\u003e', '>'

Convert Json with columns and rows to csv using powershell

Please help me to convert my below json file to csv.
{
"count": 12,
"name": "Daily Ticket",
"columnNames": [
"User",
"Channel",
"Date",
"# of Closed Incidents",
"Open",
"Response",
"Remark",
"Closed"
],
"rows": [
[
"abc",
"Service Web",
"\u00272020-06-13 00:00:00\u0027",
"1",
"0",
"0",
"0",
"1"
],
[
"xyz",
"Email",
"\u00272020-06-13 00:00:00\u0027",
"21",
"1",
"0",
"10",
"7"
]
]
}
I want column names as header and rows as rows separated with comma in csv.
The expected output is like below:
User,Channel,Date,# of Closed Incidents,Open,Response,Remark,Closed
abc,Service Web,\u00272020-06-13 00:00:00\u0027,1,0,0,0,1
xyz,Email,\u00272020-06-13 00:00:00\u0027,1,0,0,0,1
I'd offer the simplest approach I know:
$jsonText = #'
{"count":12,"name":"Daily Ticket","columnNames":["User","Channel","Date","# of Closed Incidents","Open","Response","Remark","Closed"],"rows":[["abc","Service Web","\u00272020-06-13 00:00:00\u0027","1","0","0","0","1"],["xyz","Email","\u00272020-06-13 00:00:00\u0027","21","1","0","10","7"]]}
'#
$json = $jsonText | ConvertFrom-Json
$jsonCsvLines = [System.Collections.ArrayList]::new()
[void]$jsonCsvLines.Add( $json.columnNames -join ',')
foreach ( $jsonCsvRow in $json.rows ) {
[void]$jsonCsvLines.Add( $jsonCsvRow -join ',')
}
$jsonCsvLines
$jsonCsv = $jsonCsvLines |
ConvertFrom-Csv -Delimiter ',' |
ConvertTo-Csv -Delimiter ',' -NoTypeInformation
$jsonCsvNoQuotes = $jsonCsv -replace [regex]::Escape('"')
Here
the here-string $jsonText is a compressed version of your example;
the $jsonCsvLines is simple collection (no CSV);
the $jsonCsv is a genuine csv where all fields are enclosed in double quotes, while the $jsonCsvNoQuotes is a csv where no field is enclosed in double quotes.
So far as I can tell (and I'm happy to be corrected) the format of the resulting PSCustomObjects from ConvertTo-JSON isn't suitable for direct consumption by ConvertTo-CSV
You need to relate the elements in row array with the column names. Create an array of objects with the correct property names and values. The way I solved this was to use the array index to associate each row array element with a column name:
$JSON = Get-Content 'C:\Temp\sample.json' | ConvertFrom-Json
$Rows =
ForEach($Row in $JSON.Rows )
{
$TmpHash = [Ordered]#{}
For($i = 0; $i -lt $Row.Length; ++$i )
{
$TmpHash.Add( $JSON.columnNames[$i], $Row[$i] )
}
[PSCustomObject]$TmpHash
}
$Rows | ConvertTo-Csv -NoTypeInformation
Obviously change the file name or whatnot.
On my workstation this results are like:
"User","Channel","Date","# of Closed Incidents","Open","Response","Remark","Closed"
"abc","Service Web","'2020-06-13 00:00:00'","1","0","0","0","1"
"xyz","Email","'2020-06-13 00:00:00'","21","1","0","10","7"
There are definitely different code patterns that can be employed here, but the theme should work.
One important difference is you had a unicode character for apostrophe \u0027 in the source JSON in my output that's properly interpreted. Just pointing out because it's one thing that differs from your sample.
I think this is pretty close to what you needed. Let me know if anything. Thanks.
Sorry to keep this going, but I'm intrigued by JosefZ's answer . This is not to override or critique, it's just for conversation's sake.
Considering ConvertFrom-Json returns PSCustomObjects my first answer went directly to using the same as input for ConvertTo-Csv. Even though creating Csv strings is common in the field it didn't occur to me at the time. I also didn't notice your sample output was unquoted, apologies for that.
At any rate here's my interpretation of JosefZ's answer
Note: In any of these samples -join would work just as well. I just
invoking static methods; however I don't know advantages /
disadvantages between -join & [String]::Join().
$JSON = Get-Content 'C:\Temp\sample.json' | ConvertFrom-Json
$Lines = .{
[String]::Join( ',', $JSON.columnNames )
$JSON.rows | ForEach-Object{ [String]::Join( ',', $_ ) }
}
# $Lines is already in unquoted CSV format. If you want to quote it
$Lines | ConvertFrom-Csv | ConvertTo-Csv -NoTypeInformation
If you only need the quoted or unquoted you can drop the $Lines assignment and pipe all the way through.
Quoted:
$JSON = Get-Content 'C:\Temp\sample.json' | ConvertFrom-Json
.{
[String]::Join( ',', $JSON.columnNames )
$JSON.rows | ForEach-Object{ [String]::Join( ',', $_ ) }
} | ConvertFrom-Csv | ConvertTo-Csv -NoTypeInformation
Un-Quoted:
$JSON = Get-Content 'C:\Temp\sample.json' | ConvertFrom-Json
.{
[String]::Join( ',', $JSON.columnNames )
$JSON.rows | ForEach-Object{ [String]::Join( ',', $_ ) }
}
This approach is a more concise as it can be slimmed to only a few lines. Concise doesn't always mean better or more readable, so again this isn't to override any other approach.
Of course, I don't know what you intend to do with the data after it's properly formatted. If you need to write to a file a simple | Out-File... can be added to the end of any of the above.

What is the most efficient way to replace all \ with \\, within a huge JSON File?

I have to replace all occurrences of \ with \\ within a huge JSON Lines File. I wanted to use Powershell, but there might be other options too.
The source file is 4.000.000 lines and is about 6GB.
The Powershell script I was using took too much time, I let it run for 2 hours and it wasn't done yet. A performance of half an hour would be acceptable.
$Importfile = "C:\file.jsonl"
$Exportfile = "C:\file2.jsonl"
(Get-Content -Path $Importfile) -replace "[\\]", "\\" | Set-Content -Path $Exportfile
If the replacement is simply a conversion of a single backslash to a a double backslash, the file can be processed row by row.
Using a StringBuilder puts data into a memory buffer, which is flushed on disk every now and then. Like so,
$src = "c:\path\MyBigFile.json"
$dst = "c:\path\MyOtherFile.json"
$sb = New-Object Text.StringBuilder
$reader = [IO.File]::OpenText($src)
$i = 0
$MaxRows = 10000
while($null -ne ($line = $reader.ReadLine())) {
# Replace slashes
$line = $line.replace('\', '\\')
# ' markdown coloring is confused by backslash-apostrophe
# so here is an extra one just for looks
[void]$sb.AppendLine($line)
++$i
# Write builder contents into file every now and then
if($i -ge $MaxRows) {
add-content $dst $sb.ToString() -NoNewline
[void]$sb.Clear()
$i = 0
}
}
# Flush the builder after the while loop if there's data
if($sb.Length -gt 0) {
add-content $dst $sb.ToString() -NoNewline
}
$reader.close()
Use -ReadCount parameter for Get-Content cmdlet (and set it to 0).
-ReadCount
Specifies how many lines of content are sent through the pipeline at a
time. The default value is 1. A value of 0 (zero) sends all of the
content at one time.
This parameter does not change the content displayed, but it does
affect the time it takes to display the content. As the value of
ReadCount increases, the time it takes to return the first line
increases, but the total time for the operation decreases. This can
make a perceptible difference in large items.
Example (runs cca 17× faster for a file cca 20MB):
$file = 'D:\bat\files\FileTreeLista.txt'
(Measure-Command {
$xType = (Get-Content -Path $file ) -replace "[\\]", "\\"
}).TotalSeconds, $xType.Count -join ', '
(Measure-Command {
$yType = (Get-Content -Path $file -ReadCount 0) -replace "[\\]", "\\"
}).TotalSeconds, $yType.Count -join ', '
Get-Item $file | Select-Object FullName, Length
13,3288848, 338070
0,7557814, 338070
FullName Length
-------- ------
D:\bat\files\FileTreeLista.txt 20723656
Based on the your earlier question How can I optimize this Powershell script, converting JSON to CSV?. You should try to use the PopwerShell pipeline for this, especially as it concerns large input and output files.
The point is that you shouldn't focus on single parts of the solution to determine the performance because this usually leaves wrong impression as the performance of a complete (PowerShell) pipeline solution is supposed to be better than the sum of its parts. Besides it saves a lot of memory and result is a lean PowerShell syntax...
In your specific case, if correctly setup, the CPU will replacing the slashes, rebuilds the json strings and converting it to objects while the harddisk is busy reading and writing the data...
To implement the replacement of the slashes into the PowerShell pipeline together with the ConvertFrom-JsonLines cmdlet:
Get-Content .\file.jsonl | ForEach-Object { $_.replace('\', '\\') } |
ConvertFrom-JsonLines | ForEach-Object { $_.events.items } |
Export-Csv -Path $Exportfile -NoTypeInformation -Encoding UTF8

How can I organize this location data (json output) in a text file using PowerShell?

C:\temp\GeoDATA.txt:39:Content : {"ip":"68.55.28.227","city":"Plymouth","region_code":"MI","zip":"48170"}
C:\temp\GeoDATA.txt:56:Content : {"ip":"72.95.198.227","city":"Homestead","region_code":"PA","zip":"15120"}
C:\temp\GeoDATA.txt:73:Content : {"ip":"68.180.94.219","city":"Normal","region_code":"IL","zip":"61761"}
C:\temp\GeoDATA.txt:90:Content : {"ip":"75.132.165.245","city":"Belleville","region_code":"IL","zip":"62226"}
C:\temp\GeoDATA.txt:107:Content : {"ip":"97.92.20.220","city":"Farmington","region_code":"MN","zip":"55024"}
Each line starts with the path and ends with the closing }
I would like to organize this as a table with the headers being "ip, city, region_code, zip" and the appropriate data below each header. Something like this...
ip city region_code zip
68.55.28.227 Plymouth MI 48170
72.95.198.227 Homestead PA 15120
68.180.94.219 Normal IL 61761
75.132.165.245 Belleville IL 62226
97.92.20.220 Farmington MN 55024
This is the first 5 lines of a text file with hundreds more, so please keep that in mind.
Assuming that file input.txt contains data like your sample input data, the following should work:
(Get-Content input.txt) -replace '.*: (?=\{)' | ConvertFrom-Json
-replace '.*: (?=\{)' strips the prefix from each input line using a regular expression, returning only the JSON part:
.*:  matches any sequence of characters followed by : and a space.
(?=\{) is a lookahead assertion ((?=...)) that matches a single { (escaped as \{, because { has special meaning in regexes
Since lookaround assertions aren't considered part of the substring matched by the regex, each line is only matched up to the space before the { that starts the JSON part, and by replacing the matching part with the empty string (implicitly, because no replacement string is given), it is effectively removed from each line, leaving just the JSON part.
Piping the result to ConvertFrom-Json yields a collection of custom objects whose properties reflect the JSON input, yielding the desired tabular output by default.
Assuming the data is in the test.txt file.
Try this:
$Data = $null
$Table = #()
$Data = Get-Content C:\Users\lt\AppData\Local\Temp\test.txt
$Data | %{
$IP = (($_ -split "{")[1] -split "," -split ":")[1] -replace "`"",""
$City = (($_ -split "{")[1] -split "," -split ":")[3] -replace "`"",""
$Region_Code = (($_ -split "{")[1] -split "," -split ":")[5] -replace "`"",""
$ZIP = (($_ -split "{")[1] -split "," -split ":")[7] -replace "}","" -replace "`"",""
$Table += "$IP,$City,$Region_Code,$ZIP"
}
ConvertFrom-Csv -Header "IP","City","Region_Code","ZIP" -InputObject $Table
Please let me know if this helps and don't forget to mark it as answer :).

PowerShell: Get 2 strings into a hashtable and out to .csv

PowerShell newbie here,
I need to:
Get text files in recursive local directories that have a common string, students.txt in them.
Get another string, gc.student="name,name" in the resulting file set from #1 and get the name(s).
Put the filename from #1, and just the name,name from #2 (not gc.student="") into a hashtable where the filename is paired with its corresponding name,name.
Output the hashtable to an Excel spreadsheet with 2 columns: File and Name.
I've figured out, having searched and learned here and elsewhere, how to output #1 to the screen, but not how to put it into a hashtable with #2:
$documentsfolder = "C:\records\"
foreach ($file in Get-ChildItem $documentsfolder -recurse | Select String -pattern "students.txt" ) {$file}
I'm thinking to get name in #2 I'll need to use a RegEx since there might only be 1 name sometimes.
And for the output to Excel, this: | Export-Csv -NoType output.csv
Any help moving me on is appreciated.
I think this should get you started. The explanations are in the code comments.
# base directory
$documentsfolder = 'C:\records\'
# get files with names ending with students.txt
$files = Get-ChildItem $documentsfolder -recurse | Where-Object {$_.Name -like "*students.txt"}
# process each of the files
foreach ($file in $files)
{
$fileContents = Get-Content $file
$fileName = $file.Name
#series of matches to clean up different parts of the content
#first find the gc.... pattern
$fileContents = ($fileContents | Select-String -Pattern 'gc.student=".*"').Matches.Value
# then select the string with double quotes
$fileContents = ($fileContents | Select-String '".*"').Matches.Value
# then remove the leading and trailing double quotes
$fileContents = $fileContents -replace '^"','' -replace '"$',''
# drop the objects to the pipeline so that you can pipe it to export-csv
# I am creating custom objects so that your CSV headers will nave nice column names
Write-Output [pscustomobject]#{file=$fileName;name=$fileContents}
} | Export-Csv -NoType output.csv