customer_ID in our system for organizing customer dataset at the beginning. ## Read dataset
import pandas as pd
import numpy as np
import re
df = pd.read_csv("customer_data.csv")
df
| Customer name | Internal_ID | external_ID | |
|---|---|---|---|
| 0 | Tom | A-11 | A-11 |
| 1 | Jane | B-12 | B-12 |
| 2 | Sujie | B-10 | B-10 |
| 3 | James | A-30 | A-30 |
| 4 | Sujie | B-10 | B-10 |
| ... | ... | ... | ... |
| 92 | Nancy | B-4 | B-4 |
| 93 | Joseph | A-1 | A-1 |
| 94 | John | A-5 | A-3 |
| 95 | Richard | A-3 | A-3 |
| 96 | Thomas | A-34 | A-34 |
97 rows × 3 columns
Thus, we did two things:
We requested to data handling company to classify our customer data based on their personal information, such as gender, age. They performed classification, and gave us Customer name and their ID. This ID will be replaced as external_ID in our dataset.
We also tried to classify customer information ourselves - which will be showed as Internal_ID in our dataset.
Of course, we would like to confirm whether our classification is right or not, based on external_ID for each customer. However, it's too tedius to match each Internal_ID and external_ID one by one.
Hence, we have suggested Accuracy calculation algorithm for this situation.
external_ID, and the other one for Internal_ID.external_ID. If second row and third row has same external_ID, put $0$ in the [2,3]th element of the matrix. If second row and third row has differnet external_ID, put $1$.df[2:5]
| Customer name | Internal_ID | external_ID | |
|---|---|---|---|
| 2 | Sujie | B-10 | B-10 |
| 3 | James | A-30 | A-30 |
| 4 | Sujie | B-10 | B-10 |
external_ID, put $1$ to [2,3]th element of matrix. However, second and fourth row has same external_ID, so put $0$ to [2,4]th element.external_ID and Internal_ID, make another matrix as Difference which is difference between external_ID and Internal_ID. (external_ID - Internal_ID)Then, we can confirm the correctly classified data as the number of $0$ of this Difference matrix - cause it means our classification and external_ID is same.
Since this matrix should be symmetric $([2,4] = [4,2])$, we need to fill the half-triangle part of the matrix - we can leave remaining half-triangle part as $0$.
external_ID, so remove those data from the dataset.# Remove the rows which 'external_ID' is Null
for i in range(len(df['external_ID'])):
if pd.isna(df['external_ID'][i]) == True:
df = df.drop(i)
# Reset the index numbering
df2 = df.reset_index()
# Make the dataset only including 'Internal_ID' and 'external_ID'
df3 = df2[['Internal_ID', 'external_ID']]
external_ID (ascending order) and reset the index numbering sort = (df3.sort_values(by = 'external_ID')).reset_index()
sort_external = sort['external_ID']
external_ID and Internal_ID. As originally, we need to look for paired data which have same external_ID one-by-one. It seems take a lot of time to check one-by-one. Thus, we'll use some different way as below:
external_ID (ascending order). We can pass data which has same external_ID until we find different external_ID. Then we stop to put $1$ in the matrix, and keep doing this process again.external_ID matrix, do same process by Internal_ID. The important thing is, we have to return the dataframe as the original order at this time since we have to do external_ID - Internal_ID in the right order.# Generate the matrix for `external_ID`
n = sort.shape[0]
external_table = np.zeros((n,n))
# For the first row
for i in range(n):
if sort_external[i] == sort_external[0]:
external_table[0,i] = 1
else:
break
# For the remaining rows
for i in range(1,n):
if sort_external[i-1] == sort_external[i]:
external_table[i,:] = external_table[i-1,:]
else:
for j in range(i,n):
if sort_external[i] == sort_external[j]:
external_table[i,j] = 1
else:
break
# Save this table as dataframe
external_ID = pd.DataFrame(external_table)
# Drop the `index` column from the `sort` dataset
sortdrop = sort.drop('index', axis = 1)
# Ordering by `Internal_ID` (ascending order) & reset the index numbering
order_id = (sortdrop.sort_values(by = 'Internal_ID')).reset_index()
sort_id = order_id[['Internal_ID']]
# Generate the matrix for `ID` with ordered data
n = sort.shape[0]
internal_table = np.zeros((n,n))
# For the first row
for i in range(n):
if pd.isna(re.match('nan.+', sort_id['Internal_ID'][i])) is False:
continue
elif sort_id['Internal_ID'][i] == sort_id['Internal_ID'][0]:
internal_table[0,i] = 1
else:
break
# For the remaining rows
for i in range(1,n):
if pd.isna(re.match('nan.+', sort_id['Internal_ID'][i])) is False:
continue
elif sort_id['Internal_ID'][i-1] == sort_id['Internal_ID'][i]:
internal_table[0,i] = 1
else:
break
# Convert `order_ID` to array to match their original row numbers
# listing their original row numbers with their keys
# ex) {0:167, 1:474} -> {167, 474}
id_array = np.array(order_id)
id_dict = {}
for i in range(0, len(id_array)):
id_dict[i] = id_array[i][0]
# Generate the matrix for `Internal_ID` (original)
internal_table_origin = np.zeros((n,n))
for i in range(n):
for j in range(n):
internal_table_origin[id_dict[i], id_dict[j]] = internal_table[i,j]
# Save this table as dataframe
Internal_ID = pd.DataFrame(internal_table_origin)
external_ID
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 88 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 89 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 90 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 |
| 91 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 |
| 92 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
93 rows × 93 columns
Internal_ID
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 88 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 89 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 90 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 91 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 92 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
93 rows × 93 columns
external_ID from Internal_ID, and calculate the Accuracy. As I told at the beginning, this matrix should be symmetric. Thus, we don't need to include the half-triangle part in the calculation.# Generate the matrix of difference between `Internal_ID` and `external_ID`
Difference = Internal_ID - external_ID
# Change dataframe to array & count the number of each value
dif_arr = np.array(Difference)
np.unique(dif_arr, return_counts = True)
(array([-1., 0.]), array([ 362, 8287]))
new_matrix = np.triu(Difference)
numbers = np.unique(new_matrix, return_counts = True)
numbers
(array([-1., 0.]), array([ 226, 8423]))
triangle = ((sort.shape[0]*sort.shape[0]) - sort.shape[0]) / 2
diag = sort.shape[0]
zero = numbers[1][1] - triangle - diag
accuracy = zero / triangle
print(accuracy)
0.9471715755025713
zero(0) is 4052 in the half-triangle, so the accuracy is $\frac{4052}{4278} = 94.71$%.