multiplying columns in a df using python - multiple-columns

I have a df as follows
USD Weight %
0 850,910,498,731 2.325581
1 850,910,498,731 9.325581
2 850,910,498,731 2.325532
3 850,910,498,731 4.325343
4 850,910,498,731 8.325581
5 850,910,498,731 2.325581
6 850,910,498,731 9.325581
I want to create another column called dollars which will multiply USD and weight % columns together. I am using the below code to do it but i keep getting an error.
df['$'] = df['USD'] * df['Weight %']
the error i keep getting is
ufunc 'multiply' did not contain a loop with signature matching types dtype('

Related

Average Value Calculation

How to calculate the average value of a particular column in a text file with the help of Tcl Script ?
For example I have a text file containing 3 columns like:
1 2 3
4 5 6
5 9 7
3 2 8
And I want to do the average value calculation for Column 1 only; then How can I do it using Tcl script ?
Split by spaces to get the first column values
Create an empty list to store the values
Divide the sum by its length
someFile:
1 2 3
4 5 6
5 9 7
3 2 8
Hence:
values = [] # an empty list
with open(fileName, 'r') as f:
content = f.readlines()
content = [l.strip() for l in content if l.strip()] # to remove empty lines
for line in content:
values.append(int(line.split(" ")[0])) # convert str to int and append
print(sum(values) / float(len(values)))
OUTPUT:
3.25

Filtering duplicate rows based on multiple columns (QueryBuilder 4.2)

I've ran into a little difficulty when trying to filter top N results for a table.
Assume the following table:
ID, X, Y, Result0, Result1
-------------------------------
0 0 0 1 4
1 0 1 2 5
2 0 1 1 4
3 0 2 2 5
4 0 3 0 1
5 1 3 3 4
6 1 3 2 5
7 1 3 4 6
So, let's say I want to get the top 2 results for the highest Result0 value, using Result1 as a tie breaker if the Result0 values are equal, and having only distinct values for (X,Y),
if I'll run the following query:
$result = DB::table('table')
->orderBy('Result1', 'DSC')
->orderBy('Result0', 'DSC')
->take(300)
->get();
This code will return IDs 5,7, because they have the highest Result0 values, but the X,Y for these fields are identical, and I'd like to get only top result for distinct X,Y values.
I tried adding a
->groupBy('X','Y')
But it grouped the entries based on the database order of the entries (i.e the ID) rather than my sorting of that table.
Anyone has any idea how can I achieve my goal?

I will like to know how to pull out the duplicate information

So I am new to pandas python. Currently, I am tasked to identify which IDs in the "id" column are duplicate. For example, if ID 413 appears more than 1 time, it is considered duplicate. Since there are more than 600,000 IDs, i need to know the code to it. please help!
You can use duplicated which will return a boolean series to mask the df and then call unique to return an array of the duplicated IDs:
In [196]:
df = pd.DataFrame({'ID':[0,1,1,3,4,5,6,6,6,]})
df
Out[196]:
ID
0 0
1 1
2 1
3 3
4 4
5 5
6 6
7 6
8 6
In [201]:
df[df['ID'].duplicated()]['ID'].unique()
Out[201]:
array([1, 6], dtype=int64)

Pandas: flattening repeating/wrapped columns in csv file

It often happens that data will be given to you with wrapped columns. Consider, for example:
CCY Decimals CCY Decimals CCY Decimals
AUD/CAD 5 EUR/CZK 4 GBP/NOK 5
AUD/CHF 5 EUR/DKK 5 GBP/NZD 5
AUD/DKK 5 EUR/GBP 5 GBP/PLN 5
AUD/JPY 3 EUR/HKD 5 GBP/SEK 5
AUD/NOK 5 EUR/HUF 3 GBP/SGD 5
...
Which should be parsed as a dataframe of two columns (CCY and Decimals), not six. My question is, what is the most idiomatic way of achieving this?
I would have wanted something like the following:
data = pd.read_csv("file.csv")
data.groupby(axis=1,by=data.columns.map(lambda s: s.replace("\..",""))).\
apply(lambda df : df.values.flatten())
When reading the csv file we end up with columns CCY,Decimals,CCY.1,Decimals.1 .. etc. The groupby operation returns a collection of data frames:
<pandas.core.groupby.DataFrameGroupBy object at 0x3a52b10>
Which we would then flatten using numpy functionality. So we would are converting DataFrames with repeating columns into Series, and then merging these into a result DF.
However, this doesn't work. I've tried passing the different keys arguments to groupBy, but it always complains about being unable to reindex non-unique columns.
There are a number of existing questions that deal with flattening groups of columns (e.g. "Flattening" output of group.nth in Pandas), but I can't find any that do this for repeating columns.
To use groupby, I'd do:
>>> groups = df.groupby(axis=1,by=lambda x: x.rsplit(".",1)[0])
>>> pd.DataFrame({k: v.values.flat for k,v in groups})
CCY Decimals
0 AUD/CAD 5
1 EUR/CZK 4
2 GBP/NOK 5
3 AUD/CHF 5
4 EUR/DKK 5
5 GBP/NZD 5
6 AUD/DKK 5
7 EUR/GBP 5
8 GBP/PLN 5
9 AUD/JPY 3
10 EUR/HKD 5
11 GBP/SEK 5
12 AUD/NOK 5
13 EUR/HUF 3
14 GBP/SGD 5
[15 rows x 2 columns]
and then sort.

How to apply a formula for removing data noise in R?

I am working on NGSim Traffic data, having 18 columns and 1180598 rows in a text file. I want to smooth the position data, in the column 'Local Y'. I know there are built-in functions for data smoothing in R but none of them seem to match with the formula I am required to apply. The data in text file looks something like this:
Index VehicleID Total_Frames Local Y
1 2 5 35.381
2 2 5 39.381
3 2 5 43.381
4 2 5 47.38
5 2 5 51.381
6 4 8 504.828
7 4 8 508.325
8 4 8 512.841
9 4 8 516.338
10 4 8 520.854
11 4 8 524.592
12 4 8 528.682
13 4 8 532.901
14 5 7 39.154
15 5 7 43.153
16 5 7 47.154
17 5 7 51.154
18 5 7 55.153
19 5 7 59.154
20 5 7 63.154
The above data columns are just example taken out of original file. Here you can see 3 vehicles, with vehicle IDs = 2, 4 and 5 but in fact there are 2169 vehicles with different IDS. The column Total_Frames tell us how many times vehicle Id of each vehicle is repeated in the first column, for example in the table above, vehicle ID 2 is repeated 5 times, hence '5' in Total_Frames column. Following is the formula I am required to apply to remove data noise (smoothing) from column 'Local Y':
Smoothed Position Value = (1/(Summation of [EXP^-abs(i-k)/delta] from k=i-D to i+D)) * ( (Summation of (Local Y) *[EXP^-abs(i-k)/delta] from k=i-D to i+D))
where,
i = index #
delta = 5
D = 15
I have tried using the built-in functions, which I know of, but they don't smooth the data as required. My question is: Is there any built-in function in R which can do the data smoothing in the way of given formula or which could take this formula as an argument? I need to apply the formula to every value in Local Y which has 15 values before and 15 values after them (i-D and i+D) for same vehicle Id. Can anyone give me any idea how to approach the problem? Thanks in advance.
You can place your formula in a function and then use the apply function of R to apply it to the elements in your "Local Y" column of the dataframe