LOF Exercise#
Notebook by:
Royi Avital RoyiAvital@fixelalgorithms.com
Revision History#
Version |
Date |
User |
Content / Changes |
|---|---|---|---|
1.0.000 |
13/04/2024 |
Royi Avital |
First version |
# Import Packages
# General Tools
import numpy as np
import scipy as sp
import pandas as pd
# Machine Learning
from sklearn.neighbors import LocalOutlierFactor
# Miscellaneous
import math
import os
from platform import python_version
import random
import timeit
# Typing
from typing import Callable, Dict, List, Optional, Self, Set, Tuple, Union
# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
# Jupyter
from IPython import get_ipython
from IPython.display import Image
from IPython.display import display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout, SelectionSlider
from ipywidgets import interact
Notations#
(?) Question to answer interactively.
(!) Simple task to add code for the notebook.
(@) Optional / Extra self practice.
(#) Note / Useful resource / Food for thought.
Code Notations:
someVar = 2; #<! Notation for a variable
vVector = np.random.rand(4) #<! Notation for 1D array
mMatrix = np.random.rand(4, 3) #<! Notation for 2D array
tTensor = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple = (1, 2, 3) #<! Notation for a tuple
lList = [1, 2, 3] #<! Notation for a list
dDict = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj = MyClass() #<! Notation for an object
dfData = pd.DataFrame() #<! Notation for a data frame
dsData = pd.Series() #<! Notation for a series
hObj = plt.Axes() #<! Notation for an object / handler / function handler
Code Exercise#
Single line fill
vallToFill = ???
Multi Line to Fill (At least one)
# You need to start writing
????
Section to Fill
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???
???
#===============================================================#
# Configuration
# %matplotlib inline
seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)
# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme
runInGoogleColab = 'google.colab' in str(get_ipython())
# Constants
FIG_SIZE_DEF = (8, 8)
ELM_SIZE_DEF = 50
CLASS_COLOR = ('b', 'r')
EDGE_COLOR = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF = 2
DATA_FILE_URL = r'https://github.com/FixelAlgorithmsTeam/FixelCourses/raw/master/DataSets/NewYorkTaxiDrives.csv'
# Courses Packages
# General Auxiliary Functions
Anomaly Detection by Local Outlier Factor (LOF)#
In this exercise we’ll use the LOF algorithm to identify outlier in a time series data.
The data we’ll use is the number of taxi drives in New York City at 01/07/2014-01/02/2015 (Over 6 months).
In this notebook:
We’ll build a time series features.
Fit the LOF model to data.
Visualize outliers.
(#) For visualization the
PlotLylibrary will be used.
# Parameters
# Feature Generation
lWinLength = [12, 24, 48, 12, 24, 48, 24, 48]
lWinOperators = ['Mean', 'Mean', 'Mean', 'Mean', 'Standard Deviation', 'Standard Deviation', 'Standard Deviation', 'Median', 'Median']
# Model
#===========================Fill This===========================#
# 1. Set the parameters of the LOF Model.
# !! Tweak this after looking at the data.
numNeighbors = 20
contaminationRatio = 0.05
#===============================================================#
# Anomaly
#===========================Fill This===========================#
# 1. Set the threshold for the LOF score.
# !! Tweak this after looking at the data.
# !! Use the guidelines as studied.
lofScoreThr = 1.6
#===============================================================#
Generate / Load Data#
The data set is composed of a timestamp (Resolution on 30 minutes) and the number of drives.
# Load Data
dfData = pd.read_csv(DATA_FILE_URL)
print(f'The features data shape: {dfData.shape}')
The features data shape: (10320, 2)
# Display the Data Frame
dfData.head(10)
| Time Stamp | Drives | |
|---|---|---|
| 0 | 2014-07-01 00:00:00 | 10844 |
| 1 | 2014-07-01 00:30:00 | 8127 |
| 2 | 2014-07-01 01:00:00 | 6210 |
| 3 | 2014-07-01 01:30:00 | 4656 |
| 4 | 2014-07-01 02:00:00 | 3820 |
| 5 | 2014-07-01 02:30:00 | 2873 |
| 6 | 2014-07-01 03:00:00 | 2369 |
| 7 | 2014-07-01 03:30:00 | 2064 |
| 8 | 2014-07-01 04:00:00 | 2221 |
| 9 | 2014-07-01 04:30:00 | 2158 |
Pre Process#
Convert the string into a Date Time format of Pandas.
# Convert the `Time Stamp` column into valid Pandas time stamp
#===========================Fill This===========================#
# 1. Use Pandas' `to_datetime()` to convert the `Time Stamp` column.
dfData['Time Stamp'] = pd.to_datetime(dfData['Time Stamp'])
#===============================================================#
Plot the Data#
# Plot the Data
# Plot the Data Using PlotLy
# This will create an interactive plot of the data (You may zoom in and out).
hF = px.line(data_frame = dfData, x = 'Time Stamp', y = ['Drives'], title = 'NYC Taxi Drives', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400, legend_title_text = 'Legend')
hF.show()
(?) Do you see some patterns in data?
(?) Can you spot some outliers? Why?
Feature Engineering#
Time series features engineering is an art.
Yet the basic features are the work on windows to extract statistical features: Mean, Standard Deviation, Median, etc…
The Pandas package has simple way to generate windows using the rolling() method.
# Resample Data for Hour Resolution
dfData = dfData.set_index('Time Stamp', drop = True, inplace = False)
# Resample per hour by summing
dfData = dfData.resample('H', axis = 0).sum()
/tmp/ipykernel_163986/2593492163.py:5: FutureWarning:
The 'axis' keyword in DataFrame.resample is deprecated and will be removed in a future version.
/tmp/ipykernel_163986/2593492163.py:5: FutureWarning:
'H' is deprecated and will be removed in a future version, please use 'h' instead.
# Display Sampled Data
dfData.head(10)
| Drives | |
|---|---|
| Time Stamp | |
| 2014-07-01 00:00:00 | 18971 |
| 2014-07-01 01:00:00 | 10866 |
| 2014-07-01 02:00:00 | 6693 |
| 2014-07-01 03:00:00 | 4433 |
| 2014-07-01 04:00:00 | 4379 |
| 2014-07-01 05:00:00 | 6879 |
| 2014-07-01 06:00:00 | 17565 |
| 2014-07-01 07:00:00 | 29722 |
| 2014-07-01 08:00:00 | 38266 |
| 2014-07-01 09:00:00 | 39646 |
# Plot the Data Using PlotLy
hF = px.line(data_frame = dfData, x = dfData.index, y = ['Drives'], title = 'NYC Taxi Drives', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400, legend_title_text = 'Legend')
hF.show()
# Rolling Window Operator
def ApplyRollingWindow( dsI: pd.Series, winLength: int, winOperator: str ) -> pd.Series:
# dsI - Input data series.
# winLength - The window length to calculate the feature.
# winOperator - The operation to apply on the window.
#===========================Fill This===========================#
# 1. Apply window functions by the string in `winOperator`: 'Standard Deviation', 'Median', 'Mean'.
# 2. Look at `rolling()`, `std()`, `median()` and `mean()`.
# 3. The pattern should be chaining the operation to the rolling operation: `dsI.rolling(winLength).std()`.
if winOperator == 'Standard Deviation':
dsO = dsI.rolling(winLength).std()
elif winOperator == 'Median':
dsO = dsI.rolling(winLength).median()
else:
dsO = dsI.rolling(winLength).mean()
#===============================================================#
return dsO
(@) You may add more statistical features.
(?) Are those features applicable for this method?
# Apply the Feature Extraction / Generation
lColNames = ['Drives']
for winLen, opName in zip(lWinLength, lWinOperators):
colName = opName + f'{winLen:03d}'
lColNames.append(colName)
dfData[colName] = ApplyRollingWindow(dfData['Drives'], winLen, opName)
(@) You may tweak the selection of window length and operation.
# Display Results on the Data Frame
dfData.head(20)
| Drives | Mean012 | Mean024 | Mean048 | Standard Deviation024 | Standard Deviation048 | Median048 | |
|---|---|---|---|---|---|---|---|
| Time Stamp | |||||||
| 2014-07-01 00:00:00 | 18971 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 01:00:00 | 10866 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 02:00:00 | 6693 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 03:00:00 | 4433 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 04:00:00 | 4379 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 05:00:00 | 6879 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 06:00:00 | 17565 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 07:00:00 | 29722 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 08:00:00 | 38266 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 09:00:00 | 39646 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 10:00:00 | 36704 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 11:00:00 | 35712 | 20819.666667 | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 12:00:00 | 37794 | 22388.250000 | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 13:00:00 | 37637 | 24619.166667 | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 14:00:00 | 40137 | 27406.166667 | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 15:00:00 | 37924 | 30197.083333 | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 16:00:00 | 31241 | 32435.583333 | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 17:00:00 | 36728 | 34923.000000 | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 18:00:00 | 50564 | 37672.916667 | NaN | NaN | NaN | NaN | NaN |
| 2014-07-01 19:00:00 | 51731 | 39507.000000 | NaN | NaN | NaN | NaN | NaN |
(?) Why are there
NaNvalues?
# Plot the Data Using PlotLy
hF = px.line(data_frame = dfData, x = dfData.index, y = lColNames, title = 'NYC Taxi Drives', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400, legend_title_text = 'Legend')
hF.show()
(@) Replace the features with local features such as:
Ratio between the value to the mean value (Scaled by STD).
Ratio between the value to the median value (Scaled by Median deviation).
Handle Missing Values#
Our model can not handle missing values.
Hence we must impute or remove them.
# Set the NaN Values to the first not NaN value in the column
#===========================Fill This===========================#
# 1. Loop over each column of the data frame.
# 2. Find the first valid index in each column (Use `first_valid_index()`).
# 3. Fill the NaN's up to the first valid value with the valid value.
for colName in lColNames:
dsT = dfData[colName]
firstValIdx = dsT.first_valid_index()
dfData.loc[:firstValIdx, colName] = dfData.loc[firstValIdx, colName]
#===============================================================#
# Display the Results
# Should be no NaN's.
dfData
| Drives | Mean012 | Mean024 | Mean048 | Standard Deviation024 | Standard Deviation048 | Median048 | |
|---|---|---|---|---|---|---|---|
| Time Stamp | |||||||
| 2014-07-01 00:00:00 | 18971 | 20819.666667 | 31081.958333 | 30825.145833 | 15075.515428 | 14306.974068 | 36459.5 |
| 2014-07-01 01:00:00 | 10866 | 20819.666667 | 31081.958333 | 30825.145833 | 15075.515428 | 14306.974068 | 36459.5 |
| 2014-07-01 02:00:00 | 6693 | 20819.666667 | 31081.958333 | 30825.145833 | 15075.515428 | 14306.974068 | 36459.5 |
| 2014-07-01 03:00:00 | 4433 | 20819.666667 | 31081.958333 | 30825.145833 | 15075.515428 | 14306.974068 | 36459.5 |
| 2014-07-01 04:00:00 | 4379 | 20819.666667 | 31081.958333 | 30825.145833 | 15075.515428 | 14306.974068 | 36459.5 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2015-01-31 19:00:00 | 56577 | 42386.916667 | 37830.458333 | 34813.979167 | 15600.330183 | 15027.530163 | 37951.5 |
| 2015-01-31 20:00:00 | 48276 | 44721.416667 | 37597.041667 | 34854.791667 | 15390.277972 | 15062.055831 | 37951.5 |
| 2015-01-31 21:00:00 | 48389 | 46113.333333 | 37431.291667 | 34896.312500 | 15245.028314 | 15097.253710 | 37951.5 |
| 2015-01-31 22:00:00 | 53030 | 47390.750000 | 37407.000000 | 35081.541667 | 15218.564587 | 15266.657277 | 37951.5 |
| 2015-01-31 23:00:00 | 52879 | 48202.750000 | 37404.958333 | 35379.104167 | 15216.394947 | 15474.396677 | 37951.5 |
5160 rows × 7 columns
The LOF Model#
The LOF algorithm basically learns the density of the distance to local neighbors and when the density is much lower than expected it sets the data as an outlier.
(#) The LOF is implemented by
LocalOutlierFactorthe class in SciKit Learn.
# Build the Model
#===========================Fill This===========================#
# 1. Construct the model.
# 2. Use `fit_predict()` on the data.
# 3. Extract the LOF Score.
# !! Mind the default LOF score sign.
oLofOutDet = LocalOutlierFactor(n_neighbors = numNeighbors, contamination = contaminationRatio)
vL = oLofOutDet.fit_predict(dfData)
vLofScore = -oLofOutDet.negative_outlier_factor_
#===============================================================#
# Plot the Data Using PlotLy
hF = px.histogram(x = vLofScore, title = 'LOF', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400)
hF.show()
(?) What threshold would you set?
# Set the LOF Score
dfData['LOF Score'] = vLofScore
# Set Anomaly
dfData['Anomaly'] = 0
dfData.loc[dfData['LOF Score'] > lofScoreThr,'Anomaly'] = 1
# Plot Anomalies
hF = px.line(data_frame = dfData, x = dfData.index, y = ['Drives'], title = 'NYC Taxi Drives', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400, legend_title_text = 'Legend')
hF.add_scatter(x = dfData[dfData['Anomaly'] == 1].index, y = dfData.loc[dfData['Anomaly'] == 1, 'Drives'], name = 'Anomaly', mode = 'markers')
hF.show()
precision is not so good the recall is good.
