How to create a function to create bin for all columns in pandas dataframe - binning

I have a pandas data frame that contains 30 columns named as age, salary, investments, loan etc. I have converted all numeric values to standardised values using sklean standard Scalar. Hence, all 30 columns contain standardised values. Now, I need to create three bins naming "low", "medium", "high". I have tried to create bins, manually by writing coding for each column. The code I have used is
bins = [-3,-1.5,1,3]
names=["low","med","high"]
df['age'] = pd.cut(df['age'], bins, labels=names)
It is working, but I need to write code for all 30 columns. I am not sure, how to create dynamic code to create bins for all 30 columns.

The following code will apply on all columns
bins = [-3,-1.5,1,3]
names=["low","med","high"]
df = pd.DataFrame([[-1,0,-2,-1,-2,1,0], [-1,0,-2,-2,-2,-1,1]])
df.apply(pd.cut, bins=bins, labels=names)

Related

Creating new column based on condition from another table

As my previously written question was quite ambiguous and stupidly written, I will be more explicit in this thread. This is the previous post: How can I create a new column to SQL while adding conditions?
Basically, the two tables are these:
vw_RecipeIngredientCheck (https://i.stack.imgur.com/jlArc.png)
SELECT TOP (1000) [RecipeName]
,[RecipeIngredientName]
,[Unit]
,[Amount]
,[DisplayOrder]
,[IngredientGroup]
,[VirtualProductName]
FROM [dbo].[vw_RecipeIngredientCheck]
VirtualProduct (https://i.stack.imgur.com/xU8S7.png)
SELECT TOP (1000) [Id]
,[Name]
,[NativeUnitID]
,[Mapping]
,[Kg]
,[g]
,[l]
,[dl]
,[unit]
,[empty]
,[pack]
,[teaspoon]
,[spoon]
,[can]
,[bundle]
,[clove]
,[smidgen]
,[cube]
,[stick]
,[slice]
,[IsDurable]
,[letter]
,[Glass]
,[ProductImageId]
,[ResolvesToRealProduct]
FROM [dbo].[VirtualProduct]
My goal is to create a new column in the vw_RecipeIngredientCheck table which converts the Unit to a standard unit (which is given in the VirtualProduct table and is called NativeUnitID).
To be noted that the units in the VirtualProduct table already have the logic for converting implemented.
So, the point is to create a new column in vw_RecipeIngredientCheck and using the [VirtualProductName] to then multiply the [Amount] with the column from table VirtualProduct named after its [Unit].
Example:
[NewColumn] = [Amount] * (column name in VirtualProduct = [Unit])
Essentially, after joining the two tables I got this:
Joined tables
I don't know how to write the SQL Query so that the Amount gets multiplied by the column that matches its Unit. For example, in the image above, the row with Index 33 has the following:
NativeUnitID: 3 (which is Kg)
Unit: g
Amount: 250
In the column IngredientStd I would like to have the conversion of the current amount in grams into its NativeUnitID, which is Kg. Basically, the amount should be multiplied with the column "g", which is basically the content of the Unit column. What is troubling for me is comparing the values inside [Unit] with the name of the columns.
You have to try the dynamic table creation query to achieve this.
Reference: https://www.c-sharpcorner.com/blogs/creating-a-dynamic-table-in-sql-server1
I would change the table listing various weight units to a table with two data columns - name of unit and weight coefficient relative to most common unit (say 1 gramm). Then IMHO you could easily use two joins of tables by unit name, first converting to standard unit, then to your Native Unit. IMHO there is not much value in maintaining DB logic with names of units as names of columns. Maybe you really need it (or it is very convenient to use) for something else, then my advice is not applicable.

JSON in R: fetching data from JSON

I have a dataframe of more than 10000 customers with 4 columns: customer_id,x,y,z
here x,y,z columns has data stored in JSON format and i want to fetch that data, consider that they took a survey and have answered different questions , some customers have less data inside these variables and some has more. But the name of the variable inside is same. I want an ouptput in a dataframe which contains customer_id and all the information available inside x,y,z

GNUPlot - Arbitrary number of columns in stacked line

I am working on a script that generates a csv with arbitrary number of y values for a given x value. The first row of the csv has the names of the data sets. The x value is a unix timestamp. I would like to use gnuplot to graph this data as a stacked line graph where the values are shown as fractions of a total for that row. How would I do this?
I've looked at the following solutions, and attempted to integrate them, but I cannot figure out what I am doing wrong.
It will either say not enough columns or some mismatch for the number of columns.
There are up to N columns of data for a given time index. The total I am looking at is for a given time index.
--
An example of my data:
KEY,CSS,JavaScript,Perl,Python,Shell
1428852630,0,0,0,0,406
1428852721,0,0,0,0,406
1428852793,0,0,0,0,406
1428853776,0,0,0,0,781
1429889154,0,0,0,0,1200
1429891056,0,0,0,0,1648
1429891182,0,0,0,0,1648
1429891642,0,0,0,0,1648
1430176065,0,0,0,0,2056
However, there might be a large number of columns, I want one that sets the number of columns on run time.
http://gnuplot.sourceforge.net/demo/histograms.html - This seems to have issues with being modified to have an arbitrary number of columns.
plot 'immigration.dat' using (100.*$2/$24):xtic(1) t column(2), \
for [i=3:23] '' using (100.*column(i)/column(24)) title column(i)
https://newspaint.wordpress.com/2013/09/11/creating-a-filled-stack-graph-in-gnuplot/
This answer shows how to count columns, with a slight modification:
file="file.dat"
get_number_of_cols = "awk -F, 'NR == 1 { print NF; exit }' ".file
nc=system(get_number_of_cols)
Then you need so sum colums 2 to nc, let's do it with a recursion:
sum(c,C)=((c!=C)?column(c)+sum(c+1,C):column(c))
And now you can plot:
set key outside
set datafile separator ","
plot for [i=nc:2:-1] file using 0:(100*sum(2,i)/sum(2,nc)):xtic(1) title columnhead(i) with filledcurve y=0

MS Access - split one text field dynamically into columns

I have an Excel file with 900+ column I need to import on regular basis into Access. Unfortunately I get the Excel file as such and can't change the data structure. The good news is I only need few columns of those 900+. Unfortunately MS Access can't work with files more than 255 columns.
So the idea is to import as csv file with all columns in each row in just text field. And then using VBA in Access via split to break it out again.
Question:
As I don't need all columns I want to only keep some items. So I have as input a list of column numbers I need to keep. The list is dynamic in a sense it is user defined. There is a table with all item numbers users wants to have.
I can relatively easy split the sourceTbl field.
SELECT split(field1, vbTab) from sourceTbl
If I would know I always need to extract certain columns I could probalby write in some
SELECT getItem(field1, vbTab, 1), getItem(field1, vbTab, 4), ...
Where getItem would be custom function to return item number i. Problem is which/how many columns to retrieve is not static. I read that dynamically from another table that lists the item numbers to keep.
Sample Data:
sourceTbl: field1 = abc;def;rtz;jkl;wertz;hjk
columnsToKeep: 1,4,5
Should output: abc, jkl, wertz
Excel files have around 20k rows each. About 100 MB data per file. Talking about 5 files per import. Filtered on the needed columns all data imported is about 50 MB.

Issue with SSIS on flat files to tables with fixed position

I have a couple of questions about the task on which I am stuck and any answer would be greatly appreciated.
I have to extract data from a flat file (CSV) as an input and load the data into the destination table with a specific format based on position.
For example, if I have order_id,Total_sales,Date_Ordered with some data in it, I have to extract the data and load it in a table like so:
The first field has a fixed length of 2 with numeric as a datatype.
total_sales is inserted into the column of total_sales in the table with a numeric datatype and length 10.
date as datetime in a format which would be different than that of the flat file, like ccyy-mm-dd.hh.mm.ss.xxxxxxxx (here x has to be filled up with zeros).
Maybe I don't have the right idea to solve this - any solution would be appreciated.
I have tried using the following ways:
Used a flat file source to get the CSV file and then gave it as an input to OLE DB destination with a table of fixed data types created. The problem here is that the columns are loaded, but I have to fill them up with zeros in case the date when it is been loaded or in most of the columns if I am not utilizing the total length then it has to preceded with zeros in it.
For example, if I have an Orderid of length 4 and in the flat file I have an order id like 201 then it has to be changed to 0201 when it is loaded in the table.
I also tried another way of using a flat file source and created a variable which takes the entire row as an input and tried to separate it with derived columns. I was to an extent successful in getting it, but at last the data type in the derived column got fixed to Boolean type explicitly, which I am not able to change to the data type I want.
Please give me some suggestions on how to handle this issue...
Assuming you have a csv file in the following format
order_id,Total_sales,Date_Ordered
1,123.23,01/01/2010
2,242.20,02/01/2010
3,34.23,3/01/2010
4,9032.23,19/01/2010
I would start by creating a Flat File Source (inside a Data Flow Task), but rather than having it fixed width, set the format to Delimited. Tick the Column names in the first data row. On the column tab, make sure row delimiter is set to "{CR}{LF}" and column delimiter is set to "Comma(,)". Finally, on the Advanced tab, set the data types of each column to integer, decimal and date.
You mention that you want to pad the numeric data types with leading zero's when storing them in the database. Numeric data types in databases tend not to hold leading zero's. So you have two options; either hold the data as the type they are in the target system (int, decimal and dateTime) or use the Derived Column control to convert them to strings. If you decide to store them as strings, adding an expression like
"00000" + (DT_WSTR, 5) [order_id]
to the Derived Column control will add up to 5 leading zeros to order id (don't forget to set the data type length to 5) and would result in an order id of "00001"
Create your target within a Data Flow Destination and make the table/field mappings accordingly (or let SSIS create a new table / mappings for you).