How to remove empty cells from csv while parsing csv using PapaParse? - csv

Or to put the question another way: Why is PapaParse's ParseResult.data an empty array when trimming all leading and trailing empty cells during Papa.step() function? EDIT: Please note I can achieve what I'm wanting by mapping over the parsed results and trimming, but I don't want to parse and then map, I'd rather do it all in one go.
Example CSV:
Col 1,Col 2,Col 3
1-1,1-2,
,2-2,2-3
3-1,3-2,3-3
Note that row 1 contains headers (Col 1, Col 2, etc). Row 2 col 3 is empty, and
row 3 col 1 is empty.
Given that CSV, I want to present this back to the user (as a nicely-formatted
table):
| | | |
|-----|-----|-----|
| 1-1 | 1-2 | |
| 2-2 | 2-3 | |
| 3-1 | 3-2 | 3-3 |
I want to push all rows as far to the left as they can go, and remove all empty
cells from the end of each row.
In other words, I want to trim all empty cells from both the beginning and the
end of each row. Below is the code I'm using. I have put debuggers inside of
trimEmptyCells and it is doing exactly as expected. However, the ParseResult
that parseAndTrim returns contains an empty data array.
export const parseAndTrim = (csv: string): Papa.ParseResult => {
return Papa.parse(csv, {
skipEmptyLines: true,
step: trimEmptyCells,
})
};
const trimEmptyCells = (results: Papa.ParseResult) => {
// Note that `_.dropWhile` and `_.dropRightWhile` are [lodash
// functions](https://lodash.com/docs/4.17.15#dropRight).
const leftTrimmed = _.dropWhile(results.data, (r) => r === "");
return _.dropRightWhile(leftTrimmed, (r) => r === "");
};
My first guess was
that PapaParse was experiencing errors with arrays with different lengths, but
the errors array is also empty. So I tested what I could (no step function)
at https://www.papaparse.com/demo using the example below and simply having
missing cells (not merely empty) throws no errors and returns a proper data
array.
Example test input at https://www.papaparse.com/demo
Col 1,Col 2,Col 3
1-1,1-2
,2-2,2-3

Based on this comment from pokoli (the #2 contributor to PapaParse and the #1 contributor since early 2017), I believe this is impossible. pokoli's proposed solution is
You should use Papa.parse to read records as array, filter them and then use Papa.Unparse to write the second file.
I wish I could mutate data while parsing to be faster, but PapaParse is very fast. I was able to parse a 36,000-line csv in under 300ms, and unparse in twice the time. Parsing a 2,000-line csv took under 30ms and unparse again took twice the time. My use case will involve CSVs under 2,000 lines 99% of the time so parsing into 2d array, filtering, unparsing back into csv, then parsing again into json won't take too long.

Related

Convert string column to json and parse in pyspark

My dataframe looks like
|ID|Notes|
---------------
|1|'{"Country":"USA","Count":"1000"}'|
|2|{"Country":"USA","Count":"1000"}|
ID : int
Notes : string
When i use from_json to parse the column Notes, it gives all Null values.
I need help in parsing this column Notes into columns in pyspark
When you are using from_json() function, make sure that the column value is exactly a json/dictionary in String format. In the sample data you have given, the Notes column value with id=1 is not exactly in json format (it is a string but enclosed within additional single quotes). This is the reason it is returning NULL values. Implementing the following code on the input dataframe gives the following output.
df = df.withColumn("Notes",from_json(df.Notes,MapType(StringType(),StringType())))
You need to change your input data such that the entire Notes column is in same format which is json/dictionary as a string and nothing more because it is the main reason for the issue. The below is the correct format that helps you to fix your issue.
| ID | Notes |
---------------
| 1 | {"Country":"USA","Count":"1000"} |
| 2 | {"Country":"USA","Count":"1000"} |
To parse Notes column values as columns in pyspark, you can simply use function called json_tuple() (no need to use from_json()). It extracts the elements from a json column (string format) and creates the result as new columns.
df = df.select(col("id"),json_tuple(col("Notes"),"Country","Count")) \
.toDF("id","Country","Count")
df.show()
Output:
NOTE: json_tuple() also returns null if the column value is not in the correct format (make sure the column values are json/dictionary as a string without additional quotes).

How do you write an array of numbers to a csv file?

let mut file = Writer::from_path(output_path)?;
file.write_record([5.34534536546, 34556.456456467567567, 345.56465456])?;
Produces the following error:
error[E0277]: the trait bound `{float}: AsRef<[u8]>` is not satisfied
--> src/main.rs:313:27
|
313 | file.write_record([5.34534536546, 34556.456456467567567, 345.56465456])?;
| ------------ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `AsRef<[u8]>` is not implemented for `{float}`
| |
| required by a bound introduced by this call
|
= help: the following implementations were found:
<&T as AsRef<U>>
<&mut T as AsRef<U>>
<Arc<T> as AsRef<T>>
<Box<T, A> as AsRef<T>>
and 44 others
note: required by a bound in `Writer::<W>::write_record`
--> /home/mlueder/.cargo/registry/src/github.com-1ecc6299db9ec823/csv-1.1.6/src/writer.rs:896:12
|
896 | T: AsRef<[u8]>,
| ^^^^^^^^^^^ required by this bound in `Writer::<W>::write_record`
Is there any way to use the csv crate with numbers instead of structs or characters?
Only strings or raw bytes can be written to a file; if you try to give it something else, it isn't sure how to handle the data (as #SilvioMayolo mentioned). You can map your float array to one with strings, and then you will be able to write the string array to the file.
let float_arr = [5.34534536546, 34556.456456467567567, 345.56465456];
let string_arr = float_arr.map(|e| e.to_string();
This can obviously be combined to one line without using the extra variable, but it is a little easier to see the extra step that we need to take if it is split apart.

Kusto KQL reference first object in an JSON array

I need to grab the value of the first entry in a json array with Kusto KQL in Microsoft Defender ATP.
The data format looks like this (anonymized), and I want the value of "UserName":
[{"UserName":"xyz","DomainName":"xyz","Sid":"xyz"}]
How do I split or in any other way get the "UserName" value?
In WDATP/MSTAP, for the "LoggedOnUsers" type of arrays, you want "mv-expand" (multi-value expand) in conjunction with "parsejson".
"parsejson" will turn the string into JSON, and mv-expand will expand it into LoggedOnUsers.Username, LoggedOnUsers.DomainName, and LoggedOnUsers.Sid:
DeviceInfo
| mv-expand parsejson(LoggedOnUsers)
| project DeviceName, LoggedOnUsers.UserName, LoggedOnUsers.DomainName
Keep in mind that if the packed field has multiple entries (like DeviceNetworkInfo's IPAddresses field often does), the entire row will be expanded once per entry - so a row for a machine with 3 entries in "IPAddresses" will be duplicated 3 times, with each different expansion of IpAddresses:
DeviceNetworkInfo
| where Timestamp > ago(1h)
| mv-expand parsejson(IPAddresses)
| project DeviceName, IPAddresses.IPAddress
to access the first entry's UserName property you can do the following:
print d = dynamic([{"UserName":"xyz","DomainName":"xyz","Sid":"xyz"}])
| extend result = d[0].UserName
to get the UserName for all entries, you can use mv-expand/mv-apply:
print d = dynamic([{"UserName":"xyz","DomainName":"xyz","Sid":"xyz"}])
| mv-apply d on (
project d.UserName
)
thanks for the reply, but the proposed solution didn't work for me. However instead I found the following solution:
project substring(split(split(LoggedOnUsers,',',0),'"',4),2,9)
The output of this is: UserName

MySql Seperate values in one col to many

I am retrieving data from mysql db. All the data is one column. I need to separate this into several cols: The structure of this col is as follows:
{{product ID=001 |Country=Netherlands |Repository Link=http://googt.com |Other Relevant Information=test }} ==Description== this are the below codes: code 1 code2 ==Case Study== case study 1 txt case study 2 txt ==Benefits== ben 1 ben 2 === Requirements === (empty col) === Architecture === *arch1 *arch2
So I want cols like: Product ID, Country, Repository Link, Architecture etc.....
If you are planning on simply parsing out the output of your column, it will depend on the language of choice you are currently using.
However, in general the procedure for doing this is as follows.
1, pull output into string
2, find a delimiter(In you case it appears '|' will do)
3, you have to options here(again depending on language)
A, Split each segment into an array
1, Run array through looping structure to print out each section OR use array
to manipulate data individually(your choice)
B, In Simple String method, you can either create a new string, or replace all
instances of '|' with '\n'(new line char) so that you can display all data.
I recommend the array conversion as this will allow you to easily interact with the data in a simple manner.
This is often something done today with json and other such formats which are often stored in single fields for various reasons.
Here is an example done in php making use of explode()
$unparsed = "this | is | a | string that is | not: parsed";
$parsed = explode("|", $unparsed);
echo $parsed[2]; // would be a
echo $parsed[4]; // would be not: parsed

Use JSON Input step to process uneven data

I'm trying to process the following with an JSON Input step:
{"address":[
{"AddressId":"1_1","Street":"A Street"},
{"AddressId":"1_101","Street":"Another Street"},
{"AddressId":"1_102","Street":"One more street", "Locality":"Buenos Aires"},
{"AddressId":"1_102","Locality":"New York"}
]}
However this seems not to be possible:
Json Input.0 - ERROR (version 4.2.1-stable, build 15952 from 2011-10-25 15.27.10 by buildguy) :
The data structure is not the same inside the resource!
We found 1 values for json path [$..Locality], which is different that the number retourned for path [$..Street] (3509 values).
We MUST have the same number of values for all paths.
The step provides Ignore Missing Path flag but it only works if all the rows misses the same path. In that case that step acts as as expected an fills the missing values with null.
This limits the power of this step to read uneven data, which was really one of my priorities.
My step Fields are defined as follows:
Am I missing something? Is this the correct behavior?
What I have done is use JSON Input using $.address[*] to read to a jsonRow field the full map of each element p.e:
{"address":[
{"AddressId":"1_1","Street":"A Street"},
{"AddressId":"1_101","Street":"Another Street"},
{"AddressId":"1_102","Street":"One more street", "Locality":"Buenos Aires"},
{"AddressId":"1_102","Locality":"New York"}
]}
This results in 4 jsonRows one for each element, p.e. jsonRow = {"AddressId":"1_101","Street":"Another Street"}. Then using a Javascript step I map my values using this:
var AddressId = getFromMap('AddressId', jsonRow);
var Street = getFromMap('Street', jsonRow);
var Locality = getFromMap('Locality', jsonRow);
In a second script tab I inserted minified JSON parse code from https://github.com/douglascrockford/JSON-js and the getFromMap function:
function getFromMap(key,jsonRow){
try{
var map = JSON.parse(jsonRow);
}
catch(e){
var message = "Unparsable JSON: "+jsonRow+" Desc: "+e.message;
var nr_errors = 1;
var field = "jsonRow";
var errcode = "JSON_PARSE";
_step_.putError(getInputRowMeta(), row, nr_errors, message, field, errcode);
trans_Status = SKIP_TRANSFORMATION;
return null;
}
if(map[key] == undefined){
return null;
}
trans_Status = CONTINUE_TRANSFORMATION;
return map[key]
}
You can solve this by changing the JSONPath and splitting up the steps in two JSON input steps. The following website explains a lot about JSONPath: http://goessner.net/articles/JsonPath/
$..AddressId
Does in fact return all the AddressId's in the address array, BUT since Pentaho is using grid rows for input and output [4 rows x 3 columns], it can't handle a missing value aka null value when you want as results return all the Streets (3 rows) and return all the Locality (2 rows), simply because there are no null values in the array itself as in you can't drive out of your garage with 3 wheels on your car instead of the usual 4.
I guess your script returns null (where X is zero) values like:
A S X
A S X
A S L
A X L
The scripting step can be avoided same by changing the Fields path of the first JSONinput step into:
$.address[*]
This is to retrieve all the 4 address lines. Create a next JSONinput step based on the new source field which contains the address line(s) to retrieve the address details per line:
$.AddressId
$.Street
$.Locality
This yields the null values on the four address lines when a address details is not available in an address line.