customer_ID
in our system for organizing customer dataset at the beginning. ## Read dataset
import pandas as pd
import numpy as np
import re
df = pd.read_csv("customer_data.csv")
df
Customer name | Internal_ID | external_ID | |
---|---|---|---|
0 | Tom | A-11 | A-11 |
1 | Jane | B-12 | B-12 |
2 | Sujie | B-10 | B-10 |
3 | James | A-30 | A-30 |
4 | Sujie | B-10 | B-10 |
... | ... | ... | ... |
92 | Nancy | B-4 | B-4 |
93 | Joseph | A-1 | A-1 |
94 | John | A-5 | A-3 |
95 | Richard | A-3 | A-3 |
96 | Thomas | A-34 | A-34 |
97 rows × 3 columns
Thus, we did two things:
We requested to data handling company to classify our customer data based on their personal information, such as gender
, age
. They performed classification, and gave us Customer name
and their ID
. This ID
will be replaced as external_ID
in our dataset.
We also tried to classify customer information ourselves - which will be showed as Internal_ID
in our dataset.
Of course, we would like to confirm whether our classification is right or not, based on external_ID
for each customer. However, it's too tedius to match each Internal_ID
and external_ID
one by one.
Hence, we have suggested Accuracy calculation algorithm for this situation.
external_ID
, and the other one for Internal_ID
.external_ID
. If second row and third row has same external_ID
, put $0$ in the [2,3]th element of the matrix. If second row and third row has differnet external_ID
, put $1$.df[2:5]
Customer name | Internal_ID | external_ID | |
---|---|---|---|
2 | Sujie | B-10 | B-10 |
3 | James | A-30 | A-30 |
4 | Sujie | B-10 | B-10 |
external_ID
, put $1$ to [2,3]th element of matrix. However, second and fourth row has same external_ID
, so put $0$ to [2,4]th element.external_ID
and Internal_ID
, make another matrix as Difference
which is difference between external_ID
and Internal_ID
. (external_ID
- Internal_ID
)Then, we can confirm the correctly classified data as the number of $0$ of this Difference
matrix - cause it means our classification and external_ID
is same.
Since this matrix should be symmetric $([2,4] = [4,2])$, we need to fill the half-triangle part of the matrix - we can leave remaining half-triangle part as $0$.
external_ID
, so remove those data from the dataset.# Remove the rows which 'external_ID' is Null
for i in range(len(df['external_ID'])):
if pd.isna(df['external_ID'][i]) == True:
df = df.drop(i)
# Reset the index numbering
df2 = df.reset_index()
# Make the dataset only including 'Internal_ID' and 'external_ID'
df3 = df2[['Internal_ID', 'external_ID']]
external_ID
(ascending order) and reset the index numbering sort = (df3.sort_values(by = 'external_ID')).reset_index()
sort_external = sort['external_ID']
external_ID
and Internal_ID
. As originally, we need to look for paired data which have same external_ID
one-by-one. It seems take a lot of time to check one-by-one. Thus, we'll use some different way as below:external_ID
(ascending order). We can pass data which has same external_ID
until we find different external_ID
. Then we stop to put $1$ in the matrix, and keep doing this process again.external_ID
matrix, do same process by Internal_ID
. The important thing is, we have to return the dataframe as the original order at this time since we have to do external_ID
- Internal_ID
in the right order.# Generate the matrix for `external_ID`
n = sort.shape[0]
external_table = np.zeros((n,n))
# For the first row
for i in range(n):
if sort_external[i] == sort_external[0]:
external_table[0,i] = 1
else:
break
# For the remaining rows
for i in range(1,n):
if sort_external[i-1] == sort_external[i]:
external_table[i,:] = external_table[i-1,:]
else:
for j in range(i,n):
if sort_external[i] == sort_external[j]:
external_table[i,j] = 1
else:
break
# Save this table as dataframe
external_ID = pd.DataFrame(external_table)
# Drop the `index` column from the `sort` dataset
sortdrop = sort.drop('index', axis = 1)
# Ordering by `Internal_ID` (ascending order) & reset the index numbering
order_id = (sortdrop.sort_values(by = 'Internal_ID')).reset_index()
sort_id = order_id[['Internal_ID']]
# Generate the matrix for `ID` with ordered data
n = sort.shape[0]
internal_table = np.zeros((n,n))
# For the first row
for i in range(n):
if pd.isna(re.match('nan.+', sort_id['Internal_ID'][i])) is False:
continue
elif sort_id['Internal_ID'][i] == sort_id['Internal_ID'][0]:
internal_table[0,i] = 1
else:
break
# For the remaining rows
for i in range(1,n):
if pd.isna(re.match('nan.+', sort_id['Internal_ID'][i])) is False:
continue
elif sort_id['Internal_ID'][i-1] == sort_id['Internal_ID'][i]:
internal_table[0,i] = 1
else:
break
# Convert `order_ID` to array to match their original row numbers
# listing their original row numbers with their keys
# ex) {0:167, 1:474} -> {167, 474}
id_array = np.array(order_id)
id_dict = {}
for i in range(0, len(id_array)):
id_dict[i] = id_array[i][0]
# Generate the matrix for `Internal_ID` (original)
internal_table_origin = np.zeros((n,n))
for i in range(n):
for j in range(n):
internal_table_origin[id_dict[i], id_dict[j]] = internal_table[i,j]
# Save this table as dataframe
Internal_ID = pd.DataFrame(internal_table_origin)
external_ID
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
88 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
89 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
90 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 |
91 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 |
92 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
93 rows × 93 columns
Internal_ID
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
88 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
89 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
90 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
91 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
92 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
93 rows × 93 columns
external_ID
from Internal_ID
, and calculate the Accuracy. As I told at the beginning, this matrix should be symmetric. Thus, we don't need to include the half-triangle part in the calculation.# Generate the matrix of difference between `Internal_ID` and `external_ID`
Difference = Internal_ID - external_ID
# Change dataframe to array & count the number of each value
dif_arr = np.array(Difference)
np.unique(dif_arr, return_counts = True)
(array([-1., 0.]), array([ 362, 8287]))
new_matrix = np.triu(Difference)
numbers = np.unique(new_matrix, return_counts = True)
numbers
(array([-1., 0.]), array([ 226, 8423]))
triangle = ((sort.shape[0]*sort.shape[0]) - sort.shape[0]) / 2
diag = sort.shape[0]
zero = numbers[1][1] - triangle - diag
accuracy = zero / triangle
print(accuracy)
0.9471715755025713
zero(0)
is 4052 in the half-triangle, so the accuracy is $\frac{4052}{4278} = 94.71$%.