I'm trying to to_html(formatters={column: function}) in order to make visual changes.
df = pd.DataFrame({'name':['Here',4.45454,5]}, index=['A', 'B', 'C'])
def _form(val):
value = 'STRING' if isinstance(val, str) else '{:0.0f}'.format(val)
return value
df.to_html(formatters={'name':_form})
I get
name
A
STRING
B
4.45454
C
5
instead of
| | name |
| -------- | -------------- |
| A | STRING |
| B | 4 |
| C | 5 |
Problem here is float value doesn't change.
On the other hand when I have all values float or integers, it gives desired result:
df = pd.DataFrame({'name':[323.322,4.45454,5]}, index=['A', 'B', 'C'])
def _form(val):
value = 'STRING' if isinstance(val, str) else '{:0.0f}'.format(val)
return value
df.to_html(formatters={'name':_form})
How can it be fixed?
Thank you.
Related
I'm trying to write a query in PySpark that will get the correct value from an array.
For example, I have dataframe called df with three columns, 'companyId', 'companySize' and 'weightingRange'. The 'companySize' column is just the number of employees. The column 'weightingRange' is an array with the following in it
[ {"minimum":0, "maximum":100, "weight":123},
{"minimum":101, "maximum":200, "weight":456},
{"minimum":201, "maximum":500, "weight":789}
]
so the dataframe looks like this (weightingRange is as above, its truncated in the below example for clearer formating)
+-----------+-------------+------------------------+--+
| companyId | companySize | weightingRange | |
+-----------+-------------+------------------------+--+
| ABC1 | 150 | [{"maximum":100, etc}] | |
| ABC2 | 50 | [{"maximum":100, etc}] | |
+-----------+-------------+------------------------+--+
So for a entry for company size = 150 I need to return the weight 456 into a column called 'companyWeighting'
So it should show the following
+-----------+-------------+------------------------+------------------+
| companyId | companySize | weightingRange | companyWeighting |
+-----------+-------------+------------------------+------------------+
| ABC1 | 150 | [{"maximum":100, etc}] | 456 |
| ABC2 | 50 | [{"maximum":100, etc}] | 123 |
+-----------+-------------+------------------------+------------------+
I've had a look at
df.withColumn("tmp",explode(col("weightingRange"))).select("tmp.*")
and then joining but trying to apply that would Cartesian the data.
Suggestions appreciated!
You can approach like this,
First creating a sample dataframe,
import pyspark.sql.functions as F
df = spark.createDataFrame([
('ABC1', 150, [ {"min":0, "max":100, "weight":123},
{"min":101, "max":200, "weight":456},
{"min":201, "max":500, "weight":789}]),
('ABC2', 50, [ {"min":0, "max":100, "weight":123},
{"min":101, "max":200, "weight":456},
{"min":201, "max":500, "weight":789}])],
['companyId' , 'companySize', 'weightingRange'])
Then, creating a udf function and applying it on each row to get the new column,
def get_weight(wt,wt_rnge):
for _d in wt_rnge:
if _d['min'] <= wt <= _d['max']:
return _d['weight']
get_weight_udf = F.udf(lambda x,y: get_weight(x,y))
df = df.withColumn('companyWeighting', get_weight_udf(F.col('companySize'), F.col('weightingRange')))
df.show()
You get the output as,
+---------+-----------+--------------------+----------------+
|companyId|companySize| weightingRange|companyWeighting|
+---------+-----------+--------------------+----------------+
| ABC1| 150|[Map(weight -> 12...| 456|
| ABC2| 50|[Map(weight -> 12...| 123|
+---------+-----------+--------------------+----------------+
If I want to have the content value as string of a JSON_OBJECT.item ("key") without having to
some_json_value_as_string: STRING
do
if attached {JSON_STRING} l_json_o as l_s then
Result := l_s.unescaped_string_8
elseif attached {JSON_NUMBER} l_json_o as l_n then
Result := l_n.item.out
else
check
you_forgot_to_treat_a_case: False
end
end
end
for a json object like
{
| | "datasource_name": "DODBC",
| | "datasource_username": "dev_db_usr",
| | "datasource_password": "somePassword",
| | "ewf_listening_port": 9997,
| | "log_file_path": "/var/log/ewf_app.log",
| | "default_selected_company": 1,
| | "default_selected_branch": 1,
| | "default_selected_consumption_sector": 1,
| | "default_selected_measuring_point": 1,
| | "default_selected_charge_unit": -1
| }
the {JSON_VALUE}.representation with io.putstring is:
datasource_username=dev_db_usr
and not the value only!!!
is there a way to do that? I didn't find intuitive the different methods of JSON_VALUE: values as the out method gives the class and pointer address, which is really far from a string representation of the associated json object for me...
The feature {JSON_VALUE}.representation is the string representation of the Current JSON value.
Ok, but if you have jo: JSON_OBJECT and then suppose you have datasource_username_key: STRING = "datasource_username"
You can do
if attached jo.item (datasource_username_key) as l_value then
print (l_value.representation)
end
I am attempting to convert nested JSON data to a flat table:
(I have edited this as I thought I had a working solution and was asking for advice on optimisation, turns out I don't have it working...)
import pandas as pd
import json
from collections import OrderedDict
# https://stackoverflow.com/questions/36720940/parsing-nested-json-into-dataframe
def flatten_json(json_object, container=None, name=''):
if container is None:
container = OrderedDict()
if isinstance(json_object, dict):
for key in json_object:
flatten_json(json_object[key], container=container, name=name + key + '_')
elif isinstance(json_object, list):
for n, item in enumerate(json_object, 1):
flatten_json(item, container=container, name=name + str(n) + '_')
else:
container[str(name[:-1])] = str(json_object)
return container
data = '{"page":1,"pages":2,"totaItems":22,"data":[{"eId":38344,"bId":29802,"fname":"Adon","cId":21,"cName":"Regional","vType":"None","totalMinutes":590,"minutesExcludingViolations":590,"sId":15,"snme":"CD","customFields":[{"id":3,"value":false},{"id":4,"value":false},{"id":5,"value":"2056-04-05T00:00:00Z"}]},{"eId":38344,"bId":29802,"fname":"Adon","cId":21,"cName":"Regional","vType":"None","totalMinutes":590,"minutesExcludingViolations":590,"sId":15,"snme":"CD","customFields":[{"id":3,"value":false},{"id":4,"value":false}]}]}'
json_data = json.loads(data)
dataframes = list()
for record in json_data['data']:
out = pd.DataFrame(flatten_json(record), index=[0])
dataframes.append(out)
frame = pd.concat(dataframes)
print(frame)
However I cant help but feel this might be overly complicated for what I am trying to achieve. This script is the result of a few hours research and its the best I can come up with. Does anyone have any pointers/advice to perhaps refine this?
I'm essentially completely flattening the JSON data (under the data record) into a dataframe to later be exported to CSV.
Ideal output:
+-------+-----+----------+----------------+----------------+----------------------+-------+-------+----------------------------+-----+------+--------------+-------+
| bId | cId | cName | customFields_3 | customFields_4 | customFields_5 | eId | fname | minutesExcludingViolations | sId | snme | totalMinutes | vType |
+-------+-----+----------+----------------+----------------+----------------------+-------+-------+----------------------------+-----+------+--------------+-------+
| 29802 | 21 | Regional | FALSE | FALSE | 2056-04-05T00:00:00Z | 38344 | Adon | 590 | 15 | CD | 590 | None |
| 29802 | 21 | Regional | FALSE | FALSE | null | 38344 | Adon | 590 | 15 | CD | 590 | None |
+-------+-----+----------+----------------+----------------+----------------------+-------+-------+----------------------------+-----+------+--------------+-------+
EDIT: Turns out I didn't notice but this solution doesn't work. I've added my idealised output and shortened the input data slightly to make it easier to work with for now.
EDIT2: Possible solution... Gives the right output.
main_frame = pd.DataFrame(json_data['data'])
del main_frame['customFields']
frames = list()
for record in json_data['data']:
out = pd.DataFrame.from_records(record['customFields']).T
out = out.reset_index(drop=True)
out.columns = out.iloc[0]
out = out.reindex(out.index.drop(0))
frames.append(out)
custom_fields_frame = pd.concat(frames).reset_index(drop=True)
main_frame = main_frame.join(custom_fields_frame)
print(main_frame)
Thanks,
This solution would do the job efficiently! Converting the nested json to dataframe
nested_json=[{"page":1,"pages":2,"totaItems":22,"data":[{"eId":38344,"bId":29802,"fname":"Adon","cId":21,"cName":"Regional","vType":"None","totalMinutes":590,"minutesExcludingViolations":590,"sId":15,"snme":"CD","customFields":[{"id":3,"value":"false"},{"id":4,"value":"false"},{"id":5,"value":"true"},{"id":6,"value":"false"},{"id":7,"value":"false"},{"id":14,"value":"2056-04-05T00:00:00Z"},{"id":15,"value":"Tester"}]},{"eId":38344,"bId":29802,"fname":"Adon","cId":21,"cName":"Regional","vType":"None","totalMinutes":590,"minutesExcludingViolations":590,"sId":15,"snme":"CD","customFields":[{"id":3,"value":"false"},{"id":5,"value":"true"},{"id":6,"value":"false"},{"id":7,"value":"false"},{"id":14,"value":"2056-04-05T00:00:00Z"},{"id":15,"value":"Tester"},{"id":16,"value":"false"},{"id":17,"value":"false"}]}]}]
from pandas.io.json import json_normalize
json_df = json_normalize(nested_json)
json_columns = list(json_df.columns.values)
#just picks the column_name instead of something.something.column_name
for w in range(len(json_columns)):
json_columns[w] = json_columns[w].split('.')[-1].lower()
json_df.columns = json_columns
Assuming that:
A1 = 3
B1 = customFunc(A1) // will be 3
In my custom function:
function customFunc(v) {
return v;
}
v will be 3. But I want access the cell object A1.
The following is transcribed from the comment below.
Input:
+---+---+
| | A |
+---+---+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
+---+---+
I want to copy A1:A4 to B1:C2 using a custom function.
Desired result:
+---+---+---+---+
| | A | B | C |
+---+---+---+---+
| 1 | 1 | 1 | 2 |
| 2 | 2 | 3 | 4 |
| 3 | 3 | | |
| 4 | 4 | | |
+---+---+---+---+
To achieve the desired result of splitting an input list into multiple rows, you can try the following approach.
function customFunc(value) {
if (!Array.isArray(value)) {
return value;
}
// Filter input that is more than a single column or single row.
if (value.length > 1 && value[0].length > 1) {
throw "Must provide a single value, column or row as input";
}
var result;
if (value.length == 1) {
// Extract single row from 2D array.
result = value[0];
} else {
// Extract single column from 2D array.
result = value.map(function (x) {
return x[0];
});
}
// Return the extracted list split in half between two rows.
return [
result.slice(0, Math.round(result.length/2)),
result.slice(Math.round(result.length/2))
];
}
Note that it doesn't require working with cell references. It purely deals with manipulating the input 2D array and returning a transformed 2D array.
Using the function produces the following results:
A1:A4 is hardcoded, B1 contains =customFunc(A1:A4)
+---+---+---+---+
| | A | B | C |
+---+---+---+---+
| 1 | a | a | b |
| 2 | b | c | d |
| 3 | c | | |
| 4 | d | | |
+---+---+---+---+
A1:D4 is hardcoded, A2 contains =customFunc(A1:D4)
+---+---+---+---+---+
| | A | B | C | D |
+---+---+---+---+---+
| 1 | a | b | c | d |
| 2 | a | b | | |
| 3 | c | d | | |
+---+---+---+---+---+
A1:B2 is hardcoded, A3 contains =customFunc(A1:B2), the error message is "Must provide a single value, column or row as input"
+---+---+---+---------+
| | A | B | C |
+---+---+---+---------+
| 1 | a | c | #ERROR! |
| 2 | b | d | |
+---+---+---+---------+
This approach can be built upon to perform more complicated transformations by processing more arguments (i.e. number of rows to split into, number of items per row, split into rows instead of columns, etc.) or perhaps analyzing the values themselves.
A quick example of performing arbitrary transformations by creating a function that takes a function as an argument.
This approach has the following limitations though:
you can't specify a function in a cell formula, so you'd need to create wrapper functions to call from cell formulas
this performs a uniform transformation across all of the cell values
The function:
/**
* #param {Object|Object[][]} value The cell value(s).
* #param {function=} opt_transform An optional function to used to transform the values.
* #returns {Object|Object[][]} The transformed values.
*/
function customFunc(value, opt_transform) {
transform = opt_transform || function(x) { return x; };
if (!Array.isArray(value)) {
return transform(value);
}
// Filter input that is more than a single column or single row.
if (value.length > 1 && value[0].length > 1) {
throw "Must provide a single value, column or row as input";
}
var result;
if (value.length == 1) {
// Extract single row from 2D array.
result = value[0].map(transform);
} else {
// Extract single column from 2D array.
result = value.map(function (x) {
return transform(x[0]);
});
}
// Return the extracted list split in half between two rows.
return [
result.slice(0, Math.round(result.length/2)),
result.slice(Math.round(result.length/2))
];
}
And a quick test:
function test_customFunc() {
// Single cell.
Logger.log(customFunc(2, function(x) { return x * 2; }));
// Row of values.
Logger.log(customFunc([[1, 2, 3 ,4]], function(x) { return x * 2; }));
// Column of values.
Logger.log(customFunc([[1], [2], [3], [4]], function(x) { return x * 2; }));
}
Which logs the following output:
[18-06-25 10:46:50:160 PDT] 4.0
[18-06-25 10:46:50:161 PDT] [[2.0, 4.0], [6.0, 8.0]]
[18-06-25 10:46:50:161 PDT] [[2.0, 4.0], [6.0, 8.0]]
I have the following sample DataFrame:
a | b | c |
1 | 2 | 4 |
0 | null | null|
null | 3 | 4 |
And I want to replace null values only in the first 2 columns - Column "a" and "b":
a | b | c |
1 | 2 | 4 |
0 | 0 | null|
0 | 3 | 4 |
Here is the code to create sample dataframe:
rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)])
df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"])
I know how to replace all null values using:
df2 = df2.fillna(0)
And when I try this, I lose the third column:
df2 = df2.select(df2.columns[0:1]).fillna(0)
df.fillna(0, subset=['a', 'b'])
There is a parameter named subset to choose the columns unless your spark version is lower than 1.3.1
Use a dictionary to fill values of certain columns:
df.fillna( { 'a':0, 'b':0 } )