use adding duplicate() boolean list as a new column to check duplicated rows not working - duplicates

I am a new learner and doing an assignment about identify and remove duplicates from a frame.
so I first finished my assignments by using drop_duplicates() method, and checked the df.shape before and afterwards. the rows changed from 11552 to 11398, which indicates there were duplicated rows removed.
code for identify and removing duplicates
I got curious and wanted to take a look at which rows were duplicated. and this is what I did:
I went back, read the source into dataframe df again. then used duplicated() method to get a boolean list, added this list as a new column 'duplicated' to df2=dfcopy(), and tried to slice the rows got a "True" for column 'duplicated', which should give me a table contains all the duplicated rows.
code for what I did to check duplicates in the original data and got 0 result
this returns me nothing, 0 rows. however, the fact that there were rows removed when I ran drop_duplicates() evidently shows there are duplicates in the original data?? where went wrong?
Below are the exact code I ran:
1, I first read source into the frame and removed duplicates:
df = pd.read_csv("https://cf-course_data.csv") #read data into frame df
print(df.shape) #checked there's 11552 rows
df=df.drop_duplicates() #removed duplicates
df.shape #checked shape again and now there's 11398 row
2, then I wanted to see what exactly the duplicates are:
df = pd.read_csv("https://cf-course_data.csv") #read data into a frame df again
df2 = df.copy() #created a dummy
df2['duplicated']=df.duplicated() #add duplicated() boolean list as a new column
df2.loc[df2['duplicated']=='True'] #slice the duplicates yet got a result 0

Related

Mysql removing x value form an Json column

I'm trying to update my column because I'm working on a "Kick from Team" function. I've tried different solutions that I've found on google.
My first attempt was this:
UPDATE table SET aMembers = JSON_REMOVE(aMembers, '$[1]') WHERE id = 1
aMembers looks like (Column type: JSON) :
[1, 2, 8, 99, 12, 233, 819]
That works somewhat. It'll remove the given index from aMembers. This is not what I'm after tho. I'm after something that'll remove value 1 from aMembers.
Alright, then I tried this one:
UPDATE table SET aMembers = JSON_REMOVE(aMembers, JSON_UNQUOTE(JSON_SEARCH(aMembers, 'one', '1'))) WHERE id = 1
This sets my whole column as NULL which is also not what I'm looking for. Am I doing this wrong or is this just not possible? Is there a query that'll remove id 1 from my column or am I forced to
1. With js - Get column aMembers
2. Find out at which index ID 1 is at
3. Create a new query that'll remove index X
For DB I am using MariaDB.
For my frontend I am using NextJS.
Backend is NodeJS.
I chose to solve this by using index. Each time I print the members out they will be printed in the same positions as in the JSON column which means that I can simply use index to remove x member from the column.
So each child, when printed out, get's an Index prop and this prop is being set to my api that kicks them. This then updates the JSON in mysql table row to remove index x from the column.

Pulling One Element From A CSV File

I'm trying to write a function that will return the most recent 'closing' value in a csv file containing the data of a cryptocurrency. The csv file contains 6 columns and about 900 rows and I'm looking to only pull one element of the table.
However, I seem to faced a fair bit of difficulty in pulling this off for some reason. The function below returns values from the column I want, however it seems to be pulling values from the very bottom of the document (whereas I want the most recent values).
Also, just a side note to explain what I was attempting to do with the 'count'. Since I'm expecting the value I want to be located on the second row, I wanted my for loop to only iterate through two lines of the file. However, as the result of the function went on to reveal to me, as it currently stands with the counter I'm returning two values from the function.
I understand there must be a much less convoluted way of getting the information I need so am open to any solution to the problem. Though, that being said, I'd be really interested to see where I went wrong here as I'm fairly new to Python.
Thanks a lot!
def csv_to_close(csv_file):
with open(f"{csv_file}.csv", 'r') as csvfile:
csv_file = csv.reader(csvfile)
running = True
count = 0
while running == True:
if count < 2:
for column in csv_file:
close = column[4]
count += 1
else:
running = False
print(close)

Dataframe is of type 'nonetype'. How should I alter this to allow merge function to operate?

I have pulled in data from a number of csv files, as well as a database. I wish to use a merge function to make a dataframe isolating the phone numbers that are contained in both dataframes(one originating from csv, the other originating from the database). However, the dataframe from the database displays as type 'nonetype.' This disallows any operation such as merge. How can i change this to allow the operation?
The data comes in from the database as a list of tuples. I then convert this to a dataframe. However, as stated above, it displays as 'nonetype.' I'm assuming at the moment I am confused about about how dataframes handle data types.
#Grab Data
mycursor = mydb.cursor()
mycursor.execute("SELECT DISTINCT(Cell) FROM crm_data.ap_clients Order By Cell asc;")
apclients = mycursor.fetchall()
#Clean Phone Number Data
for index, row in data.iterrows():
data['phone_number'][index] = data['phone_number'][index][-10:]
for index, row in data2.iterrows():
data2['phone_number'][index] = data2['phone_number'][index][-10:]
for index, row in data3.iterrows():
data3['phone_number'][index] = data3['phone_number'][index][-10:]
#make data frame from csv files
fbl = pd.concat([data,data2,data3], axis=0, sort=False)
#make data frame from apclients(database extraction)
apc = pd.DataFrame(apclients)
#perfrom merge finding all records in both frames
successfulleads= pd.merge(fbl, apc, left_on ='phone_number', right_on='0')
#type(apc) returns NoneType
The expected results are to find all records in both dataframes, along with a count so that I may compare the two sets. Any help is greatly appreciated from this great community :)
So it looks like I had a function to rename the column of the dataframe as shown below:
apc = apc.rename(columns={'0': 'phone_number'}, inplace=True)
for col in apc.columns:
print(col)
the code snippet out of the above responsible:
inplace=True
This snippet dictates whether or not the object is modified in the dataframe, or whether a copy is made. The return type on said object is of nonetype.
Hope this helps whoever ends up in my position. A great thanks again to the community. :)

Create a node for each column only once while importing csv into Neo4j

I have a csv file that looks the following way:
I want to create a database from it in Neo4j. Rows are nodes with labels gene, columns are also nodes with labels cell. I need to write a CREATE query that would create all my gene and cell - nodes and a relationship one for each combination of gene and cell. Currently I am stuck with the following code:
LOAD CSV WITH HEADERS FROM 'file:///merged_full.csv' AS line
CREATE (:Gene {id: line.gene_ids, name: line.wikigene_name})
I need to somehow iterate over all columns - starting from index 3 - after creating gene nodes, but I do not know how to do that.
Here are 3 queries that, performed in order, should do what you want.
This query creates a temporary Headers node with a names property that contains the collection of headers from the CSV file. It uses LIMIT 1 to only process the first row of the file. It also creates all the Cell nodes, each with it own name property.
LOAD CSV FROM 'file:///merged_full.csv' AS line
MERGE (h:Headers)
SET h.names = line
WITH line
LIMIT 1
UNWIND line[3..] AS name
MERGE (c:Cell {name: name})
This query uses the APOC function apoc.map.fromNodes to generate a map named cells, which maps each cell name to its cell node. It also gets the Headers node. It then loads the non-header data from the CSV file (using SKIP 1 to skip over the header row), and processes each row as follows. It uses MERGE to get/create a Gene node, g, with the desired id and name. It uses the REDUCE function to generate a collection of the Cell nodes that have a "1" column value in the current row, and the FOREACH clause then creates a (g)-[:HAS]->(x) relationship (if necessary) for every cell, x, in that collection.
WITH apoc.map.fromNodes('Cell', 'name') AS cells
MATCH (h:Headers)
LOAD CSV FROM 'file:///merged_full.csv' AS line
WITH h, cells, line
SKIP 1
MERGE (g:Gene {id: line[1], name: line[2]})
FOREACH(
x IN REDUCE(s = [], i IN RANGE(3, SIZE(line)-1) |
CASE line[i] WHEN "1" THEN s + cells[h.names[i]] ELSE s END) |
MERGE (g)-[:HAS]->(x))
This query just deletes the temporary Headers node (if you wish):
MATCH (h:Headers)
DELETE h;
If the columns correspond with cell nodes, then you should know all the cell nodes you need just be looking at the CSV header.
I'd recommend writing a small query just to create each of the cell nodes you need, then create an index or unique constraint on :Cell(id) (or name, or whatever the property is that is meant to identify a :Cell).
At that point the problem becomes getting and processing each relevant column (I assume only the ones with 1 as the value). APOC Procedures may help here.
apoc.map.sortedProperties() can be used to take your line map and give you a list of key/value list pairs, which you can filter down to those where the key begins with 'V', and where the value is 1, then use what's remaining to match on the relevant :Cell node and create the relationship.

how to retrieve data from a column and produce that value into a new column

I'm super new to R and have a question on how to do something. I listed the things that i got to work so ppl have an idea on what is going on. the thing im having trouble with is in bold.
-I have a data spreadsheet with 2 columns of data (End and CTCF). With the CTCF column having more cells
-I want to to take one value from the "End" column and subtract that value from each individual value in "CTCF" column (so i would have a bunch of products from each calculation)
-I want to then compare those products and find the miniumn absoulute value and the coresponding spot in the CTCF column
-then place that value into a new column ajacent to the corresponding End value.
I wrote a while loop (i know there is probably a WAY easier method) and got the calulation/comparison thing down. I was even able to output the location of the CTCF cell that contains my value of interest see below:
*
*data2<-read.csv("farah.csv")
head(data2)
periph_ctcfs<-list()
temp<-vector(mode="numeric", length = 356)
count<-1
for(i in 1:length(data2$CTCF)) while (count<357)
{
End<-data2$End[count]
periph_ctcfs<-(End-data2$CTCF)
periph_ctcfs<-abs(periph_ctcfs)
periph_ctcfs<-which.min(periph_ctcfs)
print(periph_ctcfs)
temp[]<-data2$CTCF[periph_ctcfs]
count<-count + 1;
}*
The problem is when im trying to produce the new "periph_ctcfs" column, when im trying to insert it into the "temp" vector, the last printed number gets placed within all the cells of the "temp" vector. It feels like that each time the loop goes through its not inserting the retrieved value into "temp". Can anyone help? Thanks ive included a link to a photo (below) so you can get a visual on the layout of the data. Sorry for being a n00b.
For clarity purposes: