I have a very large JSON Lines File with 4.000.000 Rows, and I need to convert several events from every row. The resulted CSV File contains 15.000.000 rows. How can I optimize this script?
I'm using Powershell core 7 and it takes around 50 hours to complete the conversion.
My Powershell script:
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
$totalrows = 4000000
$encoding = [System.Text.Encoding]::UTF8
$i = 0
$ig = 0
$output = #()
$Importfile = "C:\file.jsonl"
$Exportfile = "C:\file.csv"
if (test-path $Exportfile) {
Remove-Item -path $Exportfile
}
foreach ($line in [System.IO.File]::ReadLines($Importfile, $encoding)) {
$json = $line | ConvertFrom-Json
foreach ($item in $json.events.items) {
$CSVLine = [pscustomobject]#{
Key = $json.Register.Key
CompanyID = $json.id
Eventtype = $item.type
Eventdate = $item.date
Eventdescription = $item.description
}
$output += $CSVLine
}
$i++
$ig++
if ($i -ge 30000) {
$output | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
$i = 0
$output = #()
$minutes = $stopwatch.elapsed.TotalMinutes
$percentage = $ig / $totalrows * 100
$totalestimatedtime = $minutes * (100/$percentage)
$timeremaining = $totalestimatedtime - $minutes
Write-Host "Events: Total minutes passed: $minutes. Total minutes remaining: $timeremaining. Percentage: $percentage"
}
}
$output | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
Write-Output $ig
$stopwatch.Stop()
Here is the structure of the JSON.
{
"id": "111111111",
"name": {
"name": "Test Company GmbH",
"legalForm": "GmbH"
},
"address": {
"street": "Berlinstr.",
"postalCode": "11111",
"city": "Berlin"
},
"status": "liquidation",
"events": {
"items": [{
"type": "Liquidation",
"date": "2001-01-01",
"description": "Liquidation"
}, {
"type": "NewCompany",
"date": "2000-01-01",
"description": "Neueintragung"
}, {
"type": "ControlChange",
"date": "2002-01-01",
"description": "Tested Company GmbH"
}]
},
"relatedCompanies": {
"items": [{
"company": {
"id": "2222222",
"name": {
"name": "Test GmbH",
"legalForm": "GmbH"
},
"address": {
"city": "Berlin",
"country": "DE",
"formattedValue": "Berlin, Deutschland"
},
"status": "active"
},
"roles": [{
"date": "2002-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"demotion": true,
"group": "Control",
"dir": "Source"
}, {
"date": "2001-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"group": "Control",
"dir": "Source"
}]
}, {
"company": {
"id": "33333",
"name": {
"name": "Test2 GmbH",
"legalForm": "GmbH"
},
"address": {
"city": "Berlin",
"country": "DE",
"formattedValue": "Berlin, Deutschland"
},
"status": "active"
},
"roles": [{
"date": "2002-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"demotion": true,
"group": "Control",
"dir": "Source"
}, {
"date": "2001-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"group": "Control",
"dir": "Source"
}]
}]
}
}
as per comment: Try to avoid using the increase assignment operator (+=) to create a collection.
Use the PowerShell pipeline instead, e.g.:
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
$totalrows = 4000000
$encoding = [System.Text.Encoding]::UTF8
$i = 0
$ig = 0
$Importfile = "C:\file.jsonl"
$Exportfile = "C:\file.csv"
if (test-path $Exportfile) {
Remove-Item -path $Exportfile
}
Get-Content $Importfile -Encoding $encoding | Foreach-Object {
$json = $_ | ConvertFrom-Json
$json | ConvertFrom-Json | Foreach-Object {
[pscustomobject]#{
Key = $json.Register.Key
CompanyID = $json.id
Eventtype = $_.type
Eventdate = $_.date
Eventdescription = $_.description
}
}
$i++
$ig++
if ($i -ge 30000) {
$i = 0
$minutes = $stopwatch.elapsed.TotalMinutes
$percentage = $ig / $totalrows * 100
$totalestimatedtime = $minutes * (100/$percentage)
$timeremaining = $totalestimatedtime - $minutes
Write-Host "Events: Total minutes passed: $minutes. Total minutes remaining: $timeremaining. Percentage: $percentage"
}
} | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
Write-Output $ig
$stopwatch.Stop()
Update 2020-05-07
Based on the comments and extra info the question, I have written a small reusable cmdlet that uses the PowerShell Pipeline to read through the .jsonl (Json Lines) file. It collects each line till it find a closing '}' character then it checks for a valid json string (using Test-Json as there might embedded objects. If it is valid it intermediately release the extract object in the pipeline and start collecting lines again:
Function ConvertFrom-JsonLines {
[CmdletBinding()][OutputType([Object[]])]Param (
[Parameter(ValueFromPipeLine = $True, Mandatory = $True)][String]$Line
)
Begin { $JsonLines = [System.Collections.Generic.List[String]]#() }
Process {
$JsonLines.Add($Line)
If ( $Line.Trim().EndsWith('}') ) {
$Json = $JsonLines -Join [Environment]::NewLine
If ( Test-Json $Json -ErrorAction SilentlyContinue ) {
$Json | ConvertFrom-Json
$JsonLines.Clear()
}
}
}
}
You can use it like this:
Get-Content .\file.jsonl | ConvertFrom-JsonLines | ForEach-Object { $_.events.items } |
Export-Csv -Path $Exportfile -NoTypeInformation -Encoding UTF8
I am able to make it ~40 % faster by making two small changes: 1. use Get-Content -ReadCount and unpack the buffered lines and 2. change the pipeline to 'flow' more by avoiding the $json=+foreach part.
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
$totalrows = 4000000
$encoding = [System.Text.Encoding]::UTF8
$i = 0
$ig = 0
$Importfile = "$psscriptroot\input2.jsonl"
$Exportfile = "$psscriptroot\output.csv"
if (Test-Path $Exportfile) {
Remove-Item -Path $Exportfile
}
# Changed the next few lines
Get-Content $Importfile -Encoding $encoding -ReadCount 10000 |
ForEach-Object {
$_
} | ConvertFrom-Json | ForEach-Object {
$json = $_
$json.events.items | ForEach-Object {
[pscustomobject]#{
Key = $json.Register.Key
CompanyID = $json.id
Eventtype = $_.type
Eventdate = $_.date
Eventdescription = $_.description
}
}
$i++
$ig++
if ($i -ge 10000) {
$i = 0
$minutes = $stopwatch.elapsed.TotalMinutes
$percentage = $ig / $totalrows * 100
$totalestimatedtime = $minutes * (100 / $percentage)
$timeremaining = $totalestimatedtime - $minutes
Write-Host "Events: Total minutes passed: $minutes. Total minutes remaining: $timeremaining. Percentage: $percentage"
}
} | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ';' -Encoding UTF8 -Append
Write-Output $ig
$stopwatch.Stop()
Related
I have the following Json file:
{
"Id":1,
"Name":"john",
"Addresses":[
{
"Id":1,
"Street":"1st Street",
"City":"Riyadh"
},
{
"Id":2,
"Street":"2nd Street",
"City":"Dammam"
}
]
}
I want to remove the second address in the array using powershell.
I tried the following:
$filePath = 'C:\temp\Settings.json'
$settings = $filePath | ConvertFrom-Json
foreach($item in $settings.Addresses)
{
if($item.Id -eq 2)
{
$settings.Addresses.Remove($item)
}
}
Any ideas?
The following commented code snippet could help:
$filePath = 'C:\temp\Settings.json'
$settings = Get-Content -Path $filePath | ConvertFrom-Json
# $settings.Addresses # is an array of fixed size
# $settings.Addresses.IsFixedSize # returns True
# $settings.Addresses.Remove($item) # isn't possible; hence, let's build new array:
$NewAddresses = [System.Collections.ArrayList]::new()
foreach($item in $settings.Addresses)
{
if ($item.Id -ne 2)
{
[void]$NewAddresses.Add( $item )
}
}
# and replace old one:
$settings.Addresses = $NewAddresses
$settings | ConvertTo-Json ### | Out-File -FilePath $filePath -Force -Encoding utf8
Output: .\SO\66855002.ps1
{
"Id": 1,
"Name": "john",
"Addresses": [
{
"Id": 1,
"Street": "1st Street",
"City": "Riyadh"
}
]
}
I would like to add an additional key with value into my existing JSON file. Unfortunately I'm not able. Here an short overview:
My JSON-File before powershell script is run:
[
{
"id": "1",
"description": [
{
"country": "Brazil"
},
{
"country": "Mexico"
}
]
},
{
"id": "2",
"description": [
{
"country": "Argentina"
}
]
}
]
My wish, how the JSON-File should look like, after my powershell script is run:
[
{
"id": "1",
"description": [
{
"country": "Brazil",
"city": "Rio de Janeiro"
},
{
"country": "Mexico",
"city": "Mexico City"
}
]
},
{
"id": "2",
"description": [
{
"country": "Argentina",
"city": "Buenos Aires"
}
]
}
]
My powershell script:
function GetCity($country) {
$x = "not available"
If ( $country -eq "Brazil" ) { $x = "Rio de Janeiro" }
If ( $country -eq "Mexico" ) { $x = "Mexico City" }
If ( $country -eq "Argentina" ) { $x = "Buenos Aires" }
return $x
}
# Source the JSON content
$jsonFile = 'C:\Temp\test.json'
$jsonContent = Get-Content -Path $jsonFile
# Convert JSON to PSObjects
$jsonAsPsObjects = $jsonContent | ConvertFrom-Json
foreach ($info in $jsonAsPsObjects) {
$result = GetCity($info.description.country)
jsonContent | Add-Member -Type NoteProperty -Name "City" -Value $result
}
# Save JSON back to file
$json | ConvertTo-Json | Set-Content $jsonFile
Error:
jsonContent : The term 'jsonContent' is not recognized as the name of
a cmdlet, function, script file, or operable program. Check the
spelling of the name, or if a path was included, verify that the path
is correct and try again.
How can I solve this issue?
There at two problems:
jsonContent should be $jsonContent in statement jsonContent | Add-Member ...
You're neglecting to loop over the array elements of the description property, to each of which a city property is to be added.
I suggest streamlining your code as follows:
function Get-City {
param([string] $country)
# Use a `switch` statement:
switch ($country) {
'Brazil' { return 'Rio de Janeiro' }
'Mexico' { return 'Mexico City' }
'Argentina' { return 'Buenos Aires' }
default { return 'not available' }
}
}
$jsonFile = 'C:\Temp\test.json'
(Get-Content -Raw $jsonFile | ConvertFrom-Json) | ForEach-Object {
# Add a 'city' property to each object in the 'description' property.
$_.description.ForEach({
Add-Member -InputObject $_ city (Get-City $_.country)
})
$_ # output the modified object
} | ConvertTo-Json -Depth 3 # | Set-Content $jsonFile
I need to pull data with a particular heading from a json file and output it to a csv file
$data = (Get-Content "C:\Users\QVL6\Downloads\express-ordering-web-
variables.json" | ConvertFrom-Json)
get data
[PSCustomObject[]]$data = #(
[PSCustomObject]#{
Name = 'Name'
Type = 'Type'
Value = 'Value'
Description = 'Description'
}
)
$path = C:\Users\QVL6\
$data | Select-Object -Property Name, Type, Value, Description | Export -Csv
-Path .\data.csv -NoClobber -NoTypeInformation
Json file:
{
"Id": "variableset-Projects-174",
"OwnerId": "Projects-174",
"Version": 23,
"Variables": [
{
"Id": "dfd06d9f-5ab5-0b40-bfed-d11cd0d90e62",
"Name": "apiConfig:orderCommandUrl",
"Value": "http://dev.order-service.local",
"Description": null,
"Scope": {
"Environment": [
"Environments-63"
]
},
"IsEditable": true,
"Prompt": null,
"Type": "String",
"IsSensitive": false
},
{
"Id": "252a19a0-4650-4920-7e66-39a80c1c49ec",
"Name": "apiConfig:orderCommandUrl",
"Value": "http://qa.order-service.local",
"Description": null,
"Scope": {
"Environment": [
"Environments-63",
"Environments-64"
]
},
"IsEditable": true,
"Prompt": null,
"Type": "String",
"IsSensitive": false
},
I want to pull out all the values in Name field
Get-Content already returns a string so you could convert the output directly to json.
$data = Get-Content "C:\Users\myFile\Downloads\express-ordering-web-variables.json" | ConvertFrom-Json
The variable $data is already an object so you don't have to convert it again to a csv. You can directly select the needed headers and export them to a csv.
$data = Get-Content "C:\Users\myFile\Downloads\express-ordering-web-
variables.json" | ConvertFrom-Json
#get some random data
[PSCustomObject[]]$data = #(
[PSCustomObject]#{
H1 = 'Test'
H2 = 'Test2'
},
[PSCustomObject]#{
H1 = 'Test'
H2 = 'Test2'
}
)
$data | Select-Object -Property H1, H2 | Export-Csv -Path $Path -NoClobber -NoTypeInformation
I made this script in powershell to collect some information from the computer and I need to export to JSON format with some specifications
$osinfo = Get-WmiObject Win32_OperatingSystem -ErrorAction STOP |
Select-Object #{Name='computername';Expression={$_.CSName}};
Write-Host "Computer_INfo:"
$osinfo | ConvertTo-Json
$rede = Get-WmiObject -Class Win32_NetworkAdapterConfiguration -ErrorAction STOP | where-object -FilterScript {$_.IPEnabled -eq $true} | Select-Object #{Name='Description';Expression={$_.Description}},
#{Name='IP_Address';Expression={$_.IPAddress[0]}};
Write-Host "LAN_INfo:"
$rede | ConvertTo-Json
The result of this command generates this JSON
Computer_INfo:
{
"computername": "DESKTOP-PCJTTEG"
}
LAN_INfo:
[
{
"Description": "Hyper-V Virtual Ethernet Adapter",
"IP_Address": "192.168.65.241"
},
{
"Description": "Hyper-V Virtual Ethernet Adapter #2",
"IP_Address": "192.168.10.104"
}
]
I wanted it to be this way.
{Computer_Info:
[
{
"computername": "DESKTOP-PCJTTEG"
}
]
},LAN_INfo:{
[
{
"Description": "Hyper-V Virtual Ethernet Adapter",
"IP_Address": "192.168.65.241"
},
{
"Description": "Hyper-V Virtual Ethernet Adapter #2",
"IP_Address": "192.168.10.104" }
]
}
You can define the structure for your Json by designing your PSCustomObject in the way you desire. To have arrays, even if you have 1 element, add the #() array constructor. When converted to json, il will transpose to your missing []
Default depth when converting to Json is 4, which can be adjusted up to 100 layers.
In the case of your output, I adjusted it to avoid missing some content in the final output.
Here's your code, with the rendered output you were looking for.
$osinfo = Get-WmiObject Win32_OperatingSystem -ErrorAction STOP |
Select-Object #{Name = 'computername'; Expression = { $_.CSName } };
$rede = Get-WmiObject -Class Win32_NetworkAdapterConfiguration -ErrorAction STOP | where-object -FilterScript { $_.IPEnabled -eq $true } |
Select-Object #{Name = 'Description'; Expression = { $_.Description } },
#{Name = 'IP_Address'; Expression = { $_.IPAddress[0] } };
[PSCustomObject]#{
Computer_Info = #(
$osinfo,
#{'LAN_INfo' = $rede }
)
} | ConvertTo-Json -Depth 10
I have a JSON file that I would need a certain value to be incremented. Here is the JSON file structure I have.
{
"Body": {
"Content": {
},
"Lines": [{
"LineNumber": "1",
"DateOfService": "10/20/2017"
},
{
"LineNumber": "1",
"DateOfService": "10/20/2017"
},
{
"LineNumber": "1",
"DateOfService": "10/20/2017"
}]
}
}
I want the "LineNumber" values to be set from 1 and incremented by 1 subsequently. I have close to a 1000 lines.
Thanks for your assistance.
$InputFile = "path\name.json.txt"
$OutPutFile = "path\name.new.json"
$objJsonFile = Get-Content -Raw -Path $InputFile | ConvertFrom-Json
$intLine = 1
foreach ($itmLine in $objJsonFile.Claim.Lines)
{
$itmLine.LineNumber = [string]$intLine++
}
$objJsonFile | ConvertTo-Json -Depth 3 | Out-File $OutPutFile
You could try to do so using the following code.
$fileContent = Get-Content PATH\TO\FILE.json;
$json = $fileContent | ConvertFrom-Json;
$counter = 1;
ForEach($line in $json.Body.Lines) {
$line.LineNumber = $counter;
$counter++;
}
# write json back to file!?