Predict the number of accidents

1. Project introduction

This project focuses on predict the number of accidents in the factory by regression models, from the labor union’s point of view.

Since it is important for the labor union to give the standards about choosing safe workplace for their employees, the goal of this project is figure out which factor plays an important role for the accidents occurred and suggest the best prediction models by several analysis. Thus, the number of accidents will be a response variable in the prediction model.

Tools: R
Data: The information about accidents in factories in the country
Analyses performed: Generalized Linear Regression (Elastic-net), Discriminant analysis (LDA), Random Forest Regression

2. Data description & exploration

Data description

In this dataset, there are 10,000 observations with 13 variables related to the number of accidents occurred.

Dataset columns are below:

Year: calender year of experience
Region: US region code for the factory location
Commodity: type of food being processed in the factory
- (Dairy / Fruit & vegetables / Meat / Seafood / Starchy food)
Distance: average distance on the sea the workers need to go for fishing
- (only for seafood factory)
Function: Type of main functionality of the factory
- (Cooking / Cutting / Delivering / Packing)
Status: current operation status of the factory
- (Active / Closed by FDA/ Full-time operating / Occasionally producing / Under investigation / Shutdown / Temporarily closed)
Num_employee: number of employees
Employee_hour: total number of employee hours
Hours_office: employee hours spent in the office
Hours_operation: Employee hours spent in operation
Hours_other: employee hours spent in other activities
Safety_cost: money spent on safety equipment and education (in 10K)
Num_accidents: total number of accidents reported

Data exploration

First, I’d like to take a brief look at the relationship between accidents and some variables which possibly affect the occurrence of accidents.

From the graph above, the employee’s working hours and the number of employees have a positive relationship with the number of accidents.
However, we can see there are Year presents meaningless relationship with the accidents, It indicates that experienced years may not impact the number of accidents much.
I think the labor union might be interested in whether the cost invested in safety equipment and education affect the number of accidents in their workplace. Thus, I figured out there is a meaningful relationship between the cost and the number of accidents.

From the first graph, it is hard to realize there is a meaningful relationship between two variables the number of accidents and Safety cost (the cost in safety equipment and education). Most of the observation are located in the range where the number of accidents are less than 20, so I narrowed it down to that range and checked the graph again. However, it still seems the accidents not much be affected by the safety cost.

3. Pre-processing

- Data cleaning process

* Year

The Experienced years will not give any meaningful information for the prediction of “the number of accident in the future”, so this variable is removed.

* Hours_other

Some variables could expose the labor union to the appearance of invasion of privacy, if requested. Thus, I removed “Hours_other (employee hours spent in other activities)” for this reason.

* Distance

## [1] 942

Additionally, I have checked the number of seafood factories. Since we have those observations, we need to consider Distance variable and it could make the prediction model complex. However, those are about 10% of the whole datase \((942 / 10,000)\), which is not small, so I’ve decided to leave this variable.

* Status

## 
##    1    2    3    4    5    6    7 
##  599  174 4214 4197   20  691  101

I also observed the data related to Status. It seems there were many factories not operated, closed by FDA (2), under investigation (5), shutdown (6), Temporarily closed (7), which might be useless for current data analytics. However, this status may be changed after collecting the data, so I’ve also decided to leave them.

* Correlation

I’ve also checked correlation within all variables.

From this plot, we can expect that 4 variables have a significant relationship with the number of accidents, as below:
- Num_employee (the number of employees)
- Employee_hour (working hours of employees)
- Hours_office (employee hours spent in the office)
- Hours_operation (Employee hours spent in operation)

Also, there are some variables pair with high correlation as below:
- Function & Commodity
- Num_employee & Employee_hour & Hours_office & Hours_operation

These results seems reasonable, and those pairs possibly badly impact to our prediction model. However, I’ll leave them now cause we cannot certain that which variables actually affect the accidents.

- Missing data

After the data cleaning, there were \(660\) missing values in “Safety_cost”, and \(4\) in “Status”. \(4\) is pretty small portion of the entire observations, so we can decide to remove them from the dataset.

##                 [,1]
## Region             0
## Commodity          0
## Distance           0
## Function           0
## Status             4
## Num_employee       0
## Employee_hour      0
## Hours_office       0
## Hours_operation    0
## Safety_cost      660
## Num_accidents      0

I checked the missing patterns of the original dataset, and it seems the missing values are not randomly created.

##      Year Region Commodity Distance Function Num_employee Employee_hour
## 9336    1      1         1        1        1            1             1
## 660     1      1         1        1        1            1             1
## 4       1      1         1        1        1            1             1
##         0      0         0        0        0            0             0
##      Hours_office Hours_operation Hours_other Num_accidents Status Safety_cost
## 9336            1               1           1             1      1           1
## 660             1               1           1             1      1           0
## 4               1               1           1             1      0           1
##                 0               0           0             0      4         660
##         
## 9336   0
## 660    1
## 4      1
##      664

We can see a lot of missing values only in the Safety_cost variable. Thus, I guessed that many factories leave this field blank intentionally. For example, some factories may not want to present their low amount of investment to safety equipment or education for employees because of limited budget. This missing value could be indicated as MNAR (Missing Not a Random) type missingness, due to a privacy issue.
For the “Safety_cost”, I found that the proportion of missing values was less than 7%, so we can expect that removing these observations will not much affect the prediction model.

## [1] 0.066

We also can decide to remove those observations by histogram. The first histogram is the original distribution of Safety_cost, and the second one is the distribution without \(660\) observations.

We can see there is not a big difference after removing \(660\) observations. It means this difference will not highly impact the modeling by proportion and distribution. Hence, I have removed those observations from the dataset.

3. Modeling

I’ll perform 2 modeling methods, Generalized linear regression and Random Forest regression.
To compare the accuracy of each model, I’ll split the data into two parts: a testing data that contains 20% (1,868) of the observations and the rest 80% as a training data (7,468). Each dataset was randomly extracted, without replacing.

- Generalized Linear Regression with elastic-net regularization (GLM)

I would like to recommend the GLM with Elastic-net regularization with “10-fold cross-validation” method for predicting accident numbers occurred.
Even though it takes long times to carry out the best prediction model, cross-validation does best to reduce overfitting issue. For choosing best tuning parameter \(\lambda\) and the \(\alpha\), I’d like to perform simulation for 6 alphas (0, 0.2, 0.4, 0.6, 0.8, and 1).

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                            s1
## (Intercept)      2.734689e-01
## Region           .           
## Commodity        8.220990e-02
## Distance         3.765571e-02
## Function        -6.218749e-02
## Status          -4.033818e-02
## Num_employee     .           
## Employee_hour    1.595785e-05
## Hours_office    -1.386317e-05
## Hours_operation -9.679107e-06
## Safety_cost     -3.751076e-02

From the results, we can see the best \(\alpha\) is \(0.5\) and the best \(\lambda\) is \(0.0045\) which makes the cross-validation error minimize. Estimated parameters corresponding to this process are as above. As we can see, every variable except for Region and Num_employee are considered to include the prediction model.

- Random forest regression

I would like to suggest Random forest prediction model too. Since “Random forest” algorithm is one of the most accurate and efficient data analytic algorithm to predict a response variable, I expect that random forest model will show the best prediction.
First, I tried to find the best node size with 10 simulations based on ntree = 100 and mtry = 4.

We can see the prediction mean-squared error (MSE) showed the lowest value when the node size = \(10\). Thus, I’ll use this node size for defining prediction model.

## 
## Call:
##  randomForest(formula = Num_accidents ~ ., data = trn, ntree = 100,      mtry = 4, nodesize = 10, importance = T) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 1.614116
##                     % Var explained: 56.91

We can assess the prediction model’s accuracy by their the mean of squared residuals and % variance explained. The mean of squared residuals and % variance explained indicate how well the model fits the data. Residuals are a difference between prediction and the actual value.
In this model, the difference between prediction and the actual values are \(1.6141\) accidents on average. Since this results seems not a big difference, I’d like to say this model is pretty reliable.

Accuracy

To define the final prediction model, I checked the R_squared, RMSE, and MAE for each model.

	R_squared	RMSE	MAE
Geralized Linear Regression	0.5875031	1.669611	0.5951963
Random Forest	0.8868308	1.517550	0.5233459

From the results, we can conclude that Random Forest model showed better in terms of R_squared, RMSE, and MAE.
Although the difference of RMSE and MAE between two models are not much large, R_squared of Random Forest model is pretty larger than the GLM model. It means Random Forest model can explain the variance of the number of accidents better.
Hence, I decided to our final prediction model as Random Forest model.

Check points

In addition, I have checked the predicted values from the results. There are \(1683\) predicted values lower than \(0\). This could be weird, because we know that the number of accidents cannot be negative.

## [1] 1683

However, those values are very close to \(0\). The minimum predicted value is as below:

## [1] -5.77316e-15

Hence, we can conclude that these negative values as \(0\). In fact, the union is interested in the severity of unsafe factories, which shows the accident is higher than 0, so we don’t need to concern about if the prediction data has the negative values.

4. Classifying `safe` & `unsafe` factories

The Labor Union are possibly more interested in correctly predicting those unsafe factories. Thus, I’ll create a new binary variable showing if there is any accident for the factory (accident \(>=1\)) and perform the classification. Therefore, the response variable will be consist of 2 factors. (0: no accident occurred / 1: accident occurred)

## 
##    0    1 
## 8148 1188

There are several ways to adjust a classifier to assess the prediction correctness, and I’ll choose Discriminant Analysis (Generative models) method. The accuracy of this analysis is as below:

## [1] 0.8747323

From the results, the accuracy of this prediction model is \(87.47\)%, which is pretty reliable. Thus, this classifying model could be used to classify safe / unsafe factories for the labor union.

5. Important factors

We also want to define which factors are important to predict the number of accidents, based on Random Forest Regression model.

From the results, the number of employee and employee’s working hours seems the seems the most important factors. Thus, the key importance factors are those 2 variables - which means the number of accidents could be predicted by these variables.

5. Conclusion

From the analyzing results, I would like to recommend Random Forest regression model to predict the number of accidents in the future for the labor union.
This analysis could give insights to employees and labor unions for decision making. Furthermore, we would be able to recommend this model to people who work related to facory management - such as Project Managers (PM).