I will like to know how to pull out the duplicate information - csv

So I am new to pandas python. Currently, I am tasked to identify which IDs in the "id" column are duplicate. For example, if ID 413 appears more than 1 time, it is considered duplicate. Since there are more than 600,000 IDs, i need to know the code to it. please help!

You can use duplicated which will return a boolean series to mask the df and then call unique to return an array of the duplicated IDs:
In [196]:
df = pd.DataFrame({'ID':[0,1,1,3,4,5,6,6,6,]})
df
Out[196]:
ID
0 0
1 1
2 1
3 3
4 4
5 5
6 6
7 6
8 6
In [201]:
df[df['ID'].duplicated()]['ID'].unique()
Out[201]:
array([1, 6], dtype=int64)

Related

Shared foreign keys without duplication of entries?

Sorry for the beginner question.
I have an Outputs table:
ID
value
0
x
1
y
2
z
And an Inputs table that is linked to the Outputs through the outputsID:
ID
outputsID
name
0
0
A
1
1
B
2
1
C
3
2
B
4
2
C
Assuming that multiple outputs have at least one shared input (in this example outputID 1,3 and 2,4 are the same), is there a way to avoid the duplication of entries in my Inputs table (inputID 3 and 4)?
The 'normal' answer to your question is no. Rows 1 and 2 address output 1, and Rows 3 and 4 address output 2. They aren't duplicates and each reflect something distinct.
So if you are a beginner, I would say you shouldn't want to get rid of these rows.
That said, there are some more advanced techniques. For example, you could have the OutputsID column be an array with multiple values. This is harder, more complex, and non-standard.

Generate numbers based on a table column mysql

I have a table with the following rows:
ID Description Number
1 Test 1 4
2 Test 2 3
3 Test 3 5
4 Test 5 6
How do I create my query so that if I want ID 3, it generates the following based on the Number column:
Count
1
2
3
4
5
Thanks. :)
It looks like your rows are already uniquley identified. You need to query the row with id 3, and then preform an operation with php to count out to the end of the number set. 5 in this case. You could use arrays, and just loop throjgh the array for each number as well.

Stop duplicated indexing

I am trying to stop duplicate entry's into my database (below). eg it will come up with an error message if the vechID, Collection date, and return date is the same. I am opening my table in design view and clicking indexes and then indexing the relevant fields. but it wont work let me and keeps saying no due to duplicate values. is this the correct method
Booking ID VechID CuID Collection date Return date
1 3 7 01/07/2017 10/07/2018
2 1 7 23/04/2017 16/05/2018
3 2 1 17/05/2017 28/05/2018
4 4 2 15/05/2017 20/05/2018
5 5 2 01/06/2017 24/06/2018
6 6 2 22/07/2017 29/08/2018
7 4 8 01/07/2017 15/07/2018
8 8 8 01/08/2017 20/08/2018
9 8 2 21/01/2017 20/01/2018
10 4 8 25/09/2017 02/10/2018
13 8 8 25/09/2017 02/10/2018
Yes, you need to create a unique index on the fields (vechID, Collection date, return date).
Of course you can't do that if you already have data in your table that violates this unique index.
Use the query wizard for Duplicate Search to find and delete them.

Pandas: flattening repeating/wrapped columns in csv file

It often happens that data will be given to you with wrapped columns. Consider, for example:
CCY Decimals CCY Decimals CCY Decimals
AUD/CAD 5 EUR/CZK 4 GBP/NOK 5
AUD/CHF 5 EUR/DKK 5 GBP/NZD 5
AUD/DKK 5 EUR/GBP 5 GBP/PLN 5
AUD/JPY 3 EUR/HKD 5 GBP/SEK 5
AUD/NOK 5 EUR/HUF 3 GBP/SGD 5
...
Which should be parsed as a dataframe of two columns (CCY and Decimals), not six. My question is, what is the most idiomatic way of achieving this?
I would have wanted something like the following:
data = pd.read_csv("file.csv")
data.groupby(axis=1,by=data.columns.map(lambda s: s.replace("\..",""))).\
apply(lambda df : df.values.flatten())
When reading the csv file we end up with columns CCY,Decimals,CCY.1,Decimals.1 .. etc. The groupby operation returns a collection of data frames:
<pandas.core.groupby.DataFrameGroupBy object at 0x3a52b10>
Which we would then flatten using numpy functionality. So we would are converting DataFrames with repeating columns into Series, and then merging these into a result DF.
However, this doesn't work. I've tried passing the different keys arguments to groupBy, but it always complains about being unable to reindex non-unique columns.
There are a number of existing questions that deal with flattening groups of columns (e.g. "Flattening" output of group.nth in Pandas), but I can't find any that do this for repeating columns.
To use groupby, I'd do:
>>> groups = df.groupby(axis=1,by=lambda x: x.rsplit(".",1)[0])
>>> pd.DataFrame({k: v.values.flat for k,v in groups})
CCY Decimals
0 AUD/CAD 5
1 EUR/CZK 4
2 GBP/NOK 5
3 AUD/CHF 5
4 EUR/DKK 5
5 GBP/NZD 5
6 AUD/DKK 5
7 EUR/GBP 5
8 GBP/PLN 5
9 AUD/JPY 3
10 EUR/HKD 5
11 GBP/SEK 5
12 AUD/NOK 5
13 EUR/HUF 3
14 GBP/SGD 5
[15 rows x 2 columns]
and then sort.

Is this pattern reconstitution or what is the name for this problem?

I've following problem and don't know the terminology to describe it and hence search for possible solutions.
I have a pivot table (matrix), eg each row and column have a named header. there is a defined set for rows and columns. Now let's assume that 10 rows are "combined" meaning each column is summed up to create a new "pattern".
What I would like is a way to determine alternative row combinations that lead to the same or similar "combined" pattern.
1 1 1
5 5 5
"Combined"
6 6 6
alternate row combination:
2 2 2
4 4 4
Suggestions? How is this problem called?
http://en.wikipedia.org/wiki/System_of_linear_equations#Matrix_equation
I just have to transpose above matrix to get Matrix A
[code]
1 5
1 5
1 5
[/code]
combined matrix is a vector b:
[code]
6
6
6
[/code]
and x would be just a vector full of 1.