Exploratory data analysis of crime report

Visualization of data is the appearance of data in a pictographic or graphical form. This form facilitates top management to understand the data visually and get the messages of difficult concepts or identify new patterns. The approach of the personal understanding to handle data; applying diagrams or graphs to reflect vast volumes of complex data is more comfortable than presenting over tables or statements. In this study, we conduct data processing and data visualization for crime report data that occurred in the city of Los Angeles in the range of 2010 to 2017 using R language. The research methodology follows five steps, namely: variables identification, data pre-processing, univariate analysis, bivariate analysis, and multivariate analysis. This paper analyses data related to crime variables, time of occurrence, victims, type of crime, weapons used, distribution, and trends of crime, and the relationship between these variables. As the result shows, by using those methods, we can gain insights, understandings, new patterns, and do visual analytics from the existing data. The variations of crime variables presented in this paper are only a few of the many variations that can be made. Other variations can be performed to get more insights, understandings, and new patterns from the existing data. The methods can be performed on other types of data as well.


Introduction
Data visualization is the display of data in the form of images or graphics that can help decision-makers to be able to understand data visually and get new patterns hidden in the data. Visualization of complex and large amounts of data is more manageable for humans to understand when using pictures or graphics compared to being displayed in tabular or written form.
In a modern digital era, visions used in critical organization decision making gathered from Exploratory Data Analysis (EDA). EDA is the technique of studying one or more datasets to recognize the underlying structure of the data carried there [1]. EDA can be used to identify hidden patterns and correlations among variables in the data and assist people in confirming predictions from the data. Over the last few decades, academics have introduced various tools and techniques to visualize hidden correlations among data variables using simplistic diagrams and charts [2]- [8]. Visual data analysis aid domain-specific data interpretation such as analysis of CRISPR/Cas9 screens [2], analysis of container shipping slot bookings [9], analysis of executive functions during childhood [10], analysis of kindergarten students log data [11], sodium and potassium coronate stability [12], fault injection campaigns [13], employee demographics and earnings [14], airport waiting times [15], analysis of medical data [16] to perform analytics tasks, and analysis of Airbnb's super host profile [17]. Crime is a risk that must be faced and managed. The results from the EDA can be used as input for performing identification, analysis, and plans for handling potential risks that exist in the city [18].
Unemployment, poverty, urbanization, and rapid population growth are the primary causes of social di-lemmas. One of these problems implicit in every city is a crime. For example, as reported in [19], Indonesian police reported that the crime rate per 100,000 population in 2017 is 129 people. Although it experienced a decline from 2016, which numbered 140 people, the decline occurred less than 10 percent to lessen criminality rates, police have collected a large amount of data to analyze. The study of criminal activity and the forecast of the number of crimes remains one of the most exciting problems for researchers. Research related to crime has been widely carried out [20], [21], [22], [23].
In this study, data processing and visualization were carried out for crime report data that occurred in the city of Los Angeles in the range of 2010 to 2017. Visualization of data related to crime variables, time of occurrence, victims, types of crime, weapons used, distribution, and crime trends, and the relationship between these variables is elaborated to be further used by decisionmakers to conduct further analysis.

Methodology
The research methodology, as shown in Figure.1, follows five steps, namely: variables identification, da-ta pre-processing, univariate analysis, bivariate analysis, and multivariate analysis.

Figure 1. EDA steps
Variables identification: this is an essential step to clearly distinguish and understand the meaning of each variable in a dataset before analyzing the data. Datasets commonly have numerical, ordinal, or nominal variables [1]. An essential characteristic of numerical data is that we can apply many mathematical operations to it. A nominal, categorical, or factor variable cannot apply to mathematical operations. Ordinal variables, also referred to as ordered categorical variables or ordered factors, is a non-numeric value but possess an inherent order.
Data pre-processing: this is the second step of the EDA process. This process performs data integration (such as finding redundant attributes and tuple duplication and inconsistency), data cleaning, imputation of missing values [24], dealing with noisy data, and data reduction [25].
Univariate data analysis: the objective of the univariate analysis is to get a better understanding of each attribute. In this step, we analyze each attribute to understand how each attribute looks like. We use the ggplot2 package to visualize the data. Bivariate data analysis: the objective of the bivariate analysis is to analyze relationships between two attributes. In this step, we compare two attributes to analyze the correlation between them. We use the ggplot2 package to visualize the data. Multivariate data analysis: the objective of multivariate analysis is to get a more in-depth investigation from more than two attributes. In this step, we compare three or more attributes to analyze the correlation between them. We use the ggplot2 package to visualize the data.

Results and Discussions
The raw data are collected from the Los Angeles Police Department. The dataset reflects incidents of crime in the City of Los Angeles from 2010. The dataset represents a transcribed report from the original crime report, which is typed on paper. The original data includes over one point nine million data points for the period of 1st January 2010 to 25th November 2019. The crime report attribute includes division of records number made up of a two-digit year and five digits area ID, date reported, date occurred, time occurred, an area which referred to as geographic areas within the department, area name which represents a name designation that references a landmark of the surrounding community that is responsible for, reporting district number made up of a four-digit code that represents a sub-area within a geographic area, crime code which indicates the crime committed, modus operandi, victim age, and sex, victim descent, premise code which represents the type of structure, vehicle, or location where the crime took place, the weapon used, the status of the case, criminal code, the location which represents the street address of crime incident rounded to the nearest hundred blocks to maintain anonymity, cross street, latitude, and longitude. The data pre-processing step consists of removing the missing data, changing the data type of some at-tributes, rename the name of attributes, and finding redundant attributes and tuple duplication and inconsistency. In this study, because there are many NULL values in the data range 2018 to 2019, the range of data to be explored is from 1st January 2010 to 31st December 2017. From 1,900,312 crime report data will only be used 1,895,619 data.
Using R programming language and charts, we can analyze the crime data according to its variables, time of occurrence, victims, type of crime, weapons used, distribution of incidents, and trends of crime. Figures 2, 3, 4, and 5 show the distribution of crime incidents per year, per month, per day, and date respectively, from 2010 through 2017. Figures 2, 3, 4, 5, 6, 7, and Table 1 are an example of the results of univariate analysis. Figures 9, 12, and 14 are an example of the results of bivariate analysis. Figures 10, 11, and 13 are an example of the results of multivariate analysis.      Figure 6 shows that crime incidents from the 2nd through the 30th day of every month fluctuates between 56,000 and 66,000 incidents. Surprisingly, we can see that most crimes occur on the 1st of every month (96,879 incidents), and the lowest is on the 31st (36,851 incidents).     Figure 10, we find that they suffer the most from intimate partner-simple assault. As for males of that range of age, we find that most crime that happens to them is burglary from a vehicle.    Figure 13 shows the type of crimes that happens on those premises. Table 1 shows the count of a type of crime that happens on those premises. Figure 13. Type of crimes happens on the premises 37,214 Figure 14 shows the map of the top ten locations of crimes from 2010 until 2017. The Southwest area with reporting district number 0363 is the most unsafe area with 9,609 incidents.

Conclusion
This paper presents the result of Exploratory Data Analysis (EDA) using univariate analysis, bivariate analysis, and multivariate analysis. R programming language applied to 1,895,619 rows and 28 columns of Los Angeles Crime Report Data from 2010 until 2017. As the result shows, by using those methods, we can gain insights, understandings, and new patterns from the existing data. By performing EDA we can analyze the data using tables and various types of charts such as line charts, bar charts, stacked charts, and geo charts.
The variations of crime variables presented in this paper are only a few of the many variations that can be made. Other variations can be performed to get more insights, understandings, and new patterns from the existing data. The methods can be performed on other types of data as well.