changing column to row with specific columns - json

i have a json data which can genrate this type of table
but i wand categories data under category name with price range
my required out put like this table

Using pd.pivot_table() you can do this:
First we need to make the new 'department' 'Total Pages' and add it to the dataframe.
df_tot = df.groupby(['category', 'date'])[['below_1k', 'below_10k', 'below_100k', 'above_100k']].sum().reset_index()
df_tot.loc[:, 'department'] = 'Total Pages'
df = pd.concat([df, df_tot])
Now we can pivot our dataframe.
df_pivot = df.pivot_table(index=['date', 'department'], columns=['category']).T
df_pivot = df_pivot.swaplevel()
Output:
date 2022-08-22 ... 2022-09-05
department CD Other ... Other Total Pages
category ...
Colleges above_100k 59.0 62.0 ... NaN NaN
Exam above_100k NaN NaN ... 8.0 17.0
StudyAbroad above_100k 1.0 1.0 ... 1.0 2.0
Colleges below_100k 77.0 85.0 ... NaN NaN
Exam below_100k NaN NaN ... 26.0 63.0
StudyAbroad below_100k 4.0 5.0 ... 4.0 8.0
Colleges below_10k 28.0 31.0 ... NaN NaN
Exam below_10k NaN NaN ... 9.0 20.0
StudyAbroad below_10k 28.0 24.0 ... 23.0 47.0
Colleges below_1k 0.0 0.0 ... NaN NaN
Exam below_1k NaN NaN ... 0.0 0.0
StudyAbroad below_1k 26.0 24.0 ... 22.0 43.0

Related

Convert JSON keys to set in different columns with preset headers in pandas

Want to implement dataframe expansion from JSON data with preset values. Collating all the key values within JSON data to be set as column headers
id |Name | MonthWise
0 |ABC |{'102022':{'val':100, 'count':1}}
1 |XYZ |{'102022':{'val':20,'count':5},'092022':{'val':20,'count':2}}
2 |DEF |{}
3 |PQR |{'082022':{'val':50,'count':3}}
Here df containing MonthWise column which has JSON objects, which needs to transposed into 12 different columns like 'MMYYYY' (Yearly data)
Something like:
id |Name |042022.val | 042022.count |....|102022.val | 102022.count|....| 032023.val| 032023.count
0 |ABC |nan|nan|....|100|1|....|nan|nan
1 |XYZ |nan|nan|....|20|5|....|nan|nan
2 |DEF |nan|nan|....|nan|nan|....|nan|nan
3 |PQR |nan|nan|....|nan|nan|....|nan|nan
I have tried with df['MonthWise'].apply(pd.json_normalize(x, max_level=1)) but no success.
There is no need for apply in this case. You may use pd.json_normalize as follows
import pandas as pd
# sample data
df = pd.DataFrame({
'id': [0, 1, 2, 3],
'Name': ['ABC', 'XYZ', 'DEF', 'PQR'],
'MonthWise': [{'102022':{'val':100, 'count':1}}, {'102022':{'val':20,'count':5},'092022':{'val':20,'count':2}}, {}, {'082022':{'val':50,'count':3}}]
})
# normalize
result = pd.concat([df[['id','Name']], pd.json_normalize(df['MonthWise'])], axis=1)
This returns
id Name 102022.val 102022.count 092022.val 092022.count 082022.val 082022.count
0 0 ABC 100.0 1.0 NaN NaN NaN NaN
1 1 XYZ 20.0 5.0 20.0 2.0 NaN NaN
2 2 DEF NaN NaN NaN NaN NaN NaN
3 3 PQR NaN NaN NaN NaN 50.0 3.0
(I believe that the expected result in your original post is inconsistent with the input dataframe)

is possible to create a map with 2 keys and a vector of values in Clojure?

I am trying to create a program that reads in a table of temperatures from a csv file and would like to access a a collection of temperatures based on the year and day.
the first column stands for the year the tempratures have been recorded.
the second column stands for a specific day during each month .
the rest of the column represent the temperatures each month.
For example, 2021 - 23 - 119 = 23rd June 2021 has a temperature of 119
Year Day Months from January to December
2018 18 | 45 54 -11 170 99 166 173 177 175 93 74 69
2021 23 | 13 87 75 85 85 119 190 172 156 104 39 53
2020 23 | 63 86 62 128 131 187 163 162 138 104 60 70
So far I have managed to load the data from a CSV File with clojure.data.csv. this returns a sequence of vectors into the program
(defn Load_csv_file [filepath]
(try
(with-open [reader (io/reader filepath)]
(.skip reader 1)
( let [data (csv/read-csv reader)]
(println data) )
)
(catch Exception ex (println (str "LOL Exception: " (.toString ex))))
))
I am currently trying to figure out how to implement this but my reasoning was to create three keys in a map which will take in the year, day and vector of temperatures, to then filter for a specific value.
Any advice on how i can implement this functionality.
Thanks!
i would go with something like this:
(require '[clojure.java.io :refer [reader]]
'[clojure.string :refer [split blank?]]
'[clojure.edn :as edn])
(with-open [r (reader "data.txt")]
(doall (for [ln (rest (line-seq r))
:when (not (blank? ln))
:let [[y d & ms] (mapv edn/read-string (split ln #"\s+\|?\s+"))]]
{:year y :day d :months (vec ms)})))
;;({:year 2018,
;; :day 18,
;; :months [45 54 -11 170 99 166 173 177 175 93 74 69]}
;; {:year 2021,
;; :day 23,
;; :months [13 87 75 85 85 119 190 172 156 104 39 53]}
;; {:year 2020,
;; :day 23,
;; :months [63 86 62 128 131 187 163 162 138 104 60 70]})
by the way, i'm not sure csv format allows different separators (as you have in your example.. anyway this one would work for that)
I would create a map of data that looked something like this
{2020 {23 {:months [63 86 62 128 131 187 163 162 138 104 60 70]}}}
This way you can get the data out in a fairly easy way
(get-in data [2020 23 :months]
So something like this
(->> (Load_csv_file "file.csv")
(reduce (fn [acc [year day & months]] (assoc-in acc [year day] months)) {}))
This will result in the data structure I mentioned now you just need to figure out the location of the data you want

Flatten nested JSON columns in Pandas

I'm trying to find an easy way to flatten a nested JSON present in a dataframe column. The dataframe column looks as follows:
stock Name Annual
x Tesla {"0": {"date": "2020","dateFormatted": "2020-12-31","sharesMln": "3856.2405","shares": 3856240500},"1": {"date": "2019","dateFormatted": "2019-12-31","sharesMln": "3856.2405","shares": 3856240500}}
y Google {"0": {"date": "2020","dateFormatted": "2020-12-31","sharesMln": "2526.4506","shares": 2526450600},"1": {"date": "2019","dateFormatted": "2019-12-31","sharesMln": "2526.4506","shares": 2526450600},"2": {"date": "2018","dateFormatted": "2018-12-31","sharesMln": "2578.0992","shares": 2578099200}}
z Big Apple {}
How do I convert the above dataframe to:
Stock Name date dateFormatted sharesMln shares
x Tesla 2020 2020-12-31 3856.2405 3856240500
x Tesla 2019 2019-12-31 3856.2405 3856240500
y Google 2020 2020-12-31 2526.4506 2526450600
y Google 2019 2019-12-31 2526.4506 2526450600
y Google 2018 2018-12-31 2578.0992 2578099200
z Big Apple None None None None
I've tried using pd.json_normalize(dataframe['Annual'],max_level=1) but struggling to get the desired result as mentioned above.
Any pointers will be appreciated.
Get values from dicts and transform each element of the list to a row with explode while index is duplicated. Then, expand the nested dict (values of your first dict) to columns. Finally, you have to join your original dataframe with the new dataframe.
>>> df
stock Name Annual
0 x Tesla {'0': {'date': '2020', 'dateFormatted': '2020-...
1 y Google {'0': {'date': '2020', 'dateFormatted': '2020-...
2 z Big Apple {}
data = df['Annual'].apply(lambda x: x.values()) \
.explode() \
.apply(pd.Series)
df = df.join(data).drop(columns='Annual')
Output result:
>>> df
stock Name date dateFormatted sharesMln shares
0 x Tesla 2020 2020-12-31 3856.2405 3.856240e+09
0 x Tesla 2019 2019-12-31 3856.2405 3.856240e+09
1 y Google 2020 2020-12-31 2526.4506 2.526451e+09
1 y Google 2019 2019-12-31 2526.4506 2.526451e+09
1 y Google 2018 2018-12-31 2578.0992 2.578099e+09
2 z Big Apple NaN NaN NaN NaN

Custom datetimeparsing to combine date and time after reading csv - Pandas

upon reading text file, I am presented with an odd format, where date and time are contained in separate columns, as follows (files is tabs as separators).
temp
room 1
Date Time simulation
Fri, 01/Jan 00:30 11.94
01:30 12
02:30 12.04
03:30 12.06
04:30 12.08
05:30 12.09
06:30 11.99
07:30 12.01
08:30 12.29
09:30 12.46
10:30 12.35
11:30 12.25
12:30 12.19
13:30 12.12
14:30 12.04
15:30 11.96
16:30 11.9
17:30 11.92
18:30 11.87
19:30 11.79
20:30 12
21:30 12.16
22:30 12.27
23:30 12.3
Sat, 02/Jan 00:30 12.25
01:30 12.19
02:30 12.14
03:30 12.11
etc.
I would like to:
parse date and time over two columns ([0],[1]);
shift all timestamps 30minutes early, that is replacing :30 with :00;
I have used the following code:
timeparse = lambda x: pd.datetime.strptime(x.replace(':30',':00'), '%H:%M')
df = pd.read_csv('Chart_1.txt',
sep='\t',
skiprows=1,
date_parser=timeparse,
parse_dates=['Time'],
header=1)
Which does seem to be parsing time not dates (obviously, as this is what I told it to do).
Also, skipping rows is useful for finding the Date and Time headers, but it discards the headers temp and room 1, that I need.
You can use:
import pandas as pd
df = pd.read_csv('Chart_1.txt', sep='\t')
#get temperature to variable tempfrom third column
temp = df.columns[2]
print (temp)
Dry resultant temperature (°C)
#get aps to variable aps from second row and third column
aps = df.iloc[1, 2]
print (aps)
AE4854c_Campshill_openings reduced_communal areas increased openings2.aps
#create mask from first column - all values contains / - dates
mask = df.iloc[:, 0].str.contains('/',na=False)
#shift all rows to right NOT contain dates
df1 = df[~mask].shift(1, axis=1)
#get rows with dates
df2 = df[mask]
#concat df1 and df2, sort unsorted indexes
df = pd.concat([df1, df2]).sort_index()
#create new column names by assign
#first 3 are custom, other are from first row and fourth to end columns
df.columns = ['date','time','no name'] + df.iloc[0, 3:].tolist()
#remove first 2 row
df = df[2:]
#fill NaN values in column date by forward filling
df.date = df.date.ffill()
#convert column to datetime
df.date = pd.to_datetime(df.date, format='%a, %d/%b')
#replace 30 minutes to 00
df.time = df.time.str.replace(':30', ':00')
print (df.head())
date time no name 3F_T09_SE_SW_Bed1 GF_office_S GF_office_W_tea \
2 1900-01-01 00:00 11.94 11.47 14.72 16.66
3 1900-01-01 01:00 12.00 11.63 14.83 16.69
4 1900-01-01 02:00 12.04 11.73 14.85 16.68
5 1900-01-01 03:00 12.06 11.80 14.83 16.65
6 1900-01-01 04:00 12.08 11.84 14.79 16.62
GF_Act.Room GF_Communal areas GF_Reception GF_Ent Lobby ... \
2 17.41 12.74 12.93 10.85 ...
3 17.45 12.74 13.14 11.00 ...
4 17.44 12.71 13.23 11.09 ...
5 17.41 12.68 13.27 11.16 ...
6 17.36 12.65 13.28 11.21 ...
2F_S01_SE_SW_Bedroom 2F_S01_SE_SW_Int Circ 2F_S01_SE_SW_Storage_int circ \
2 12.58 12.17 12.54
3 12.64 12.22 12.49
4 12.68 12.27 12.48
5 12.70 12.30 12.49
6 12.71 12.31 12.51
GF_G01_SE_SW_Bedroom GF_G01_SE_SW_Storage_Bed 3F_T09_SE_SW_Bathroom \
2 14.51 14.61 11.49
3 14.55 14.59 11.50
4 14.56 14.59 11.52
5 14.55 14.58 11.54
6 14.54 14.57 11.56
3F_T09_SE_SW_Circ 3F_T09_SE_SW_Storage_int circ GF_Lounge GF_Cafe
2 11.52 11.38 12.83 12.86
3 11.56 11.35 13.03 13.03
4 11.61 11.36 13.13 13.13
5 11.65 11.39 13.17 13.17
6 11.68 11.42 13.18 13.18
[5 rows x 31 columns]

Matlab's nanmean( ) function not working with dimensions other than 1

Take this example from the mathworks help of nanmean():
X = magic(3);
X([1 6:9]) = repmat(NaN,1,5)
X =
NaN 1 NaN
3 5 NaN
4 NaN NaN
>> y = nanmean(X,2)
??? Error using ==> nanmean
Too many input arguments.
Why is it showing error even when the docs say the mean can be taken in any dimension dim of X as y = nanmean(X,dim)? Thanks.
I run exactly the code you have and I get no error. In particlar here is what I ran:
>> X = magic(3);
X([1 6:9]) = repmat(NaN,1,5)
X =
NaN 1 NaN
3 5 NaN
4 NaN NaN
>> y = nanmean(X,2)
y =
1
4
4
>> which nanmean
C:\Program Files\MATLAB\R2010b\toolbox\stats\stats\nanmean.m
The only thing I can think of is that you have a different version of nanmean.m on your path. Try a which nanmean and see if it points into the stats toolbox.
here is the reason:
If X contains a vector of all NaN values along some dimension, the vector is empty once the NaN values are removed, so the sum of the remaining elements is 0. Since the mean involves division by 0, its value is NaN. The output NaN is not a mean of NaN values.
Look at:
http://www.mathworks.com/help/toolbox/stats/nanmean.html