I have a long (csv) file with "column-name x value" pairs which I would like to read into a pandas.DataFrame
user_id col val
00008901 1 55
00008901 2 66
00011501 1 77
00011501 3 88
00011501 4 99
The result should look like this:
1 2 3 4
00008901 55 66 0 0
00011501 77 0 88 99
I tried to read it into a list and create a DataFrame from it, but pandas crashed as I have 4.5 million elements.
What the best way to do that? Ideally directly with read_csv.
First use read_csv for create DataFrame:
df = pd.to_csv('file.csv')
Then need set_index with unstack:
df1 = df.set_index(['user_id','col'])['val'].unstack(fill_value=0)
print (df1)
col 1 2 3 4
user_id
8901 55 66 0 0
11501 77 0 88 99
Another solution with pivot, replacing NaN to 0 by fillna and last cast to int:
df1 = df.pivot(index='user_id', columns='col', values='val').fillna(0).astype(int)
print (df1)
col 1 2 3 4
user_id
8901 55 66 0 0
11501 77 0 88 99
If get error:
"ValueError: Index contains duplicate entries, cannot reshape"
It means you have some duplicates, so fastest solution is groupby with unstack and some aggreagte function like mean or sum:
print (df.groupby(['user_id','col'])['val'].mean().unstack(fill_value=0))
col 1 2 3 4
user_id
8901 55 66 0 0
11501 77 0 88 99
Better it see in a bit changed csv:
print (df)
user_id col val
0 8901 1 55
1 8901 2 66
2 11501 1 77 > duplicates -> 11501 and 1
3 11501 1 151 > duplicates -> 11501 and 1
4 11501 3 88
5 11501 4 99
print (df.groupby(['user_id','col'])['val'].mean().unstack(fill_value=0))
col 1 2 3 4
user_id
8901 55 66 0 0
11501 114 0 88 99
Actually I thought I had no duplicates, but found out that I really have some ...
I could not use ".mean" as it is categorial value, but solved the problem by first looking at the sorted table and then just keeping the last entry ... then applying the (great !) solution .. which I still have to fully understand ;-)
df.sort(columns=(['user_id','col']) ) # optional for debugging
df.drop_duplicates(subset=['user_id','col'], keep='last', inplace=True)
df_table = df.set_index(['user_id','col'])['val'].unstack(fill_value=0)
You can't directly read in the required structure using read_csv. But you can use pivot_table function to convert to the desired structure.
df = pd.read_csv('filepath/your.csv')
df = pd.pivot_table(df, index='user_id', columns='col', values='val, aggfunc='mean').reset_index()
The output will be like
1 2 3 4
00008901 55 66 0 0
00011501 77 0 88 99
I don't think it is possible to use read_csv to parse the csv file.
You can create a data structure such as dictionary and use it to create a dataframe:
import pandas as pd
from collections import defaultdict
import csv
data_dict = defaultdict(lambda: [0] * columns)
columns = 4
delimiter = ','
with open("my_csv.csv") as csv_file:
reader = csv.DictReader(csv_file,delimiter=delimiter)
for row in reader:
row_id = row["user_id"]
col = int(row["col"])-1
val = int(row["val"])
data_dict[row_id][col] = val
df = pd.DataFrame(data_dict.values(), index=data_dict.keys(), columns=range(1, columns+1))
For a csv file that contains:
user_id,col,val
00008901,1,55
00008901,2,66
00011501,1,77
00011501,3,88
00011501,4,99
The output is:
1 2 3 4
00008901 55 66 0 0
00011501 77 0 88 99
Related
I have a result table like this
UniqueID
Value
1
100
1
98
1
99
2
100
2
98
3
99
and I want to display the result as follows using SQL
UnqiueID
Value1
Value2
Value3
1
100
98
99
2
100
98
3
99
Please suggest queries using single table and multiple tables
I have a data frame with over 800,000 rows and 8 columns that looks like this:
ID Index Var1 Var2 Var3 Var4 Var5 Var6
1 0 106 114 24 25 1 0
1 1 705 79 19 21 1 0
1 2 661 361 30 37 1 0
1 3 212 332 30 37 1 0
I am trying to get this dataframe into a JSON format for a small piece of a larger machine learning project.
I need the json formatted object to look like this:
{'source-ref':'s3://sagemaker...jpg',
'name':{'annotations':[{'ID':1,
'var1': 106,
'var2': 114,
'var3': 24,
'var4': 25}]}
I've tried using the to_json function with a combination of the reset_index function on the ID and index. Can someone please point me in the right direction?
800k records will generate a very large JSON object...
df = pd.read_csv(io.StringIO("""ID Index Var1 Var2 Var3 Var4 Var5 Var6
1 0 106 114 24 25 1 0
1 1 705 79 19 21 1 0
1 2 661 361 30 37 1 0
1 3 212 332 30 37 1 0"""),sep="\s+")
{'source-ref':'s3://sagemaker...jpg',
'name':{'annotations':df.to_dict(orient="records")}}
I want to read a CSV file in Octave which has a date column and 4 columns which are integers. I have used.
[num,txt,raw] = dlmread('Mitteilungen_data.csv');
ID = num(:,1) ;
DATE = datestr (date, yyyy-mm-dd) ;
FK_OBSERVERS= num(:,2) ;
GROUPS = num(:,3) ;
SUNSPOTS = num(:,4) ;
WOLF = num(:,5) ;
dn=datenum(DATE,'YYYY-MM-DD');
plot(dn,WOLF)
Sample Data:
ID DATE FK_OBSERVERS GROUPS SUNSPOTS WOLF
4939 1612-01-17 11 5 11 61
83855 1612-01-18 85 2 2 22
4940 1612-01-20 11 4 5 45
4941 1612-01-21 11 4 7 47
4942 1612-01-23 11 3 5 35
4943 1612-01-24 11 3 6 36
4944 1612-01-25 11 6 13 73
4945 1612-01-27 11 3 6 36
83856 1612-01-28 85 NULL NULL NULL
4946 1612-01-29 11 3 6 36
4947 1612-01-30 11 4 8 48
4948 1612-02-02 11 5 8 58
4949 1612-02-05 11 4 7 47
4950 1612-02-06 11 3 7 37
4951 1612-02-10 11 5 7 57
4952 1612-02-12 11 3 4 34
4953 1612-02-13 11 2 2 22
4954 1612-02-14 11 3 3 33
The Date column is showing an error: element number 2 undefined in return list. How can I fix this?
You are using
[num, txt, raw] = dlmread( %...
but dlmread does not return three outputs. Type help dlmread in your console to see the syntax.
What does seem to return these three arguments is the xlsread command. Perhaps you copied code that used xlsread?
However, even so, I would still use csv2cell. Type csv2cell('data.csv') (where data.csv is the name of your file) to see what kind of output it gives
Before you can use any of the commands defined in the io package, you need to load it on your workspace.
pkg load io
I am using SQL Server 2008 R2.
I am having a database table that contains some user data as given below :
Id UserId Sys Dia ReadingType DataId IsDeleted
1 10 98 65 last 1390556024216 0
2 10 99 69 average 1390556024216 0
3 10 102 96 last 1390562788540 0
4 10 102 96 average 1390562788540 0
5 11 130 98 last 1390631241547 0
6 11 130 98 average 1390631241547 0
7 2 285 199 first 1390770562374 0
8 2 250 180 last 1390770562374 0
9 2 267 189 average 1390770562374 0
10 1 258 180 first 1391191009457 0
11 1 258 180 last 1391191009457 0
12 1 258 180 average 1391191009457 0
13 1 285 199 additional 1391191009457 0
14 22 110 78 last 1391549208338 0
15 22 123 83 last 1391549208349 0
In this table, there are records that are having the same DataId but different ReadingType.
I want to set IsDeleted=1 for the records having ReadingType='last' and having a record with ReadingType='average' with the same DataId, Sys, Dia and UserId.
So the Desired result shoul be :
Id UserId Sys Dia Reading DataId IsDeleted
1 10 98 65 last 1390556024216 0
2 10 99 69 average 1390556024216 0
3 10 102 96 last 1390562788540 1
4 10 102 96 average 1390562788540 0
5 11 130 98 last 1390631241547 1
6 11 130 98 average 1390631241547 0
7 2 285 199 first 1390770562374 0
8 2 250 180 last 1390770562374 0
9 2 267 189 average 1390770562374 0
10 1 258 180 first 1391191009457 0
11 1 258 180 last 1391191009457 1
12 1 258 180 average 1391191009457 0
13 1 285 199 additional 1391191009457 0
14 22 110 78 last 1391549208338 0
15 22 123 83 last 1391549208349 0
Here the records with Id 3, 5 and 11 should be marked as deleted as they are having same UserId, Sys, Dia, DataId and ReadingType="last" with another record having ReadingType="average" with same other fields.
Can anyone help me how to find out such a records and update them?
Just use UPDATE with EXISTS subquery:
UPDATE T
SET IsDeleted=1
WHERE
ReadingType='last'
AND
EXISTS(SELECT * FROM T as T1
WHERE T1.ReadingType='average'
AND T1.DataId=T.DataId
AND T1.Sys=T.Sys
AND T1.Dia=T.Dia
AND T1.UserId=T.UserId
)
SQLFiddle demo
You Can solve many way but here i am using the sub-query to solve your problem
UPDATE TABLE SET IsDeleted=1
WHERE DataId=(SELECT DataId FROM TABEL WHERE Reading='last')
I would like to compare three columns in two different tables by joing through unique identifier.But my for single identifier there are multiple rows will be returned.
Example:
Table A
Identifier Flag1 Flag2 Flag3
1 56 36 46
1 89 65 33
1 56 89 22
1 11 89 65
Table B
Identifier Flag 1 Flag2 Flag3
1 56 36 46
1 89 65 33
1 56 89 22
1 10 89 65
Now i would like to compare these two tables based on Identifier 1 , can you please help me out if all the column values are matching i need update the flag.Thanks in advance
I think below code will help,
below code will update flag of table A for records whose column values are same as that of table b
update a
set someflag = 1
where exists
(
select * from B
where b.flag1 = a.flag1
and b.flag2 = a.flag2
and b.flag3 = a.flag3
)