LOF Exercise

LOF Exercise#

Notebook by:

Royi Avital RoyiAvital@fixelalgorithms.com

Revision History#

Version	Date	User	Content / Changes
1.0.000	13/04/2024	Royi Avital	First version

# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.neighbors import LocalOutlierFactor

# Miscellaneous
import math
import os
from platform import python_version
import random
import timeit

# Typing
from typing import Callable, Dict, List, Optional, Self, Set, Tuple, Union

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image
from IPython.display import display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout, SelectionSlider
from ipywidgets import interact

Notations#

(?) Question to answer interactively.
(!) Simple task to add code for the notebook.
(@) Optional / Extra self practice.
(#) Note / Useful resource / Food for thought.

Code Notations:

someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler

Code Exercise#

Single line fill

vallToFill = ???

Multi Line to Fill (At least one)

# You need to start writing
????

Section to Fill

#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

???
#===============================================================#

# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

DATA_FILE_URL = r'https://github.com/FixelAlgorithmsTeam/FixelCourses/raw/master/DataSets/NewYorkTaxiDrives.csv'

# Courses Packages

# General Auxiliary Functions

Anomaly Detection by Local Outlier Factor (LOF)#

In this exercise we’ll use the LOF algorithm to identify outlier in a time series data.
The data we’ll use is the number of taxi drives in New York City at 01/07/2014-01/02/2015 (Over 6 months).

In this notebook:

We’ll build a time series features.
Fit the LOF model to data.
Visualize outliers.

(#) For visualization the PlotLy library will be used.

# Parameters

# Feature Generation
lWinLength      = [12, 24, 48, 12, 24, 48, 24, 48]
lWinOperators   = ['Mean', 'Mean', 'Mean', 'Mean', 'Standard Deviation', 'Standard Deviation', 'Standard Deviation', 'Median', 'Median']

# Model
#===========================Fill This===========================#
# 1. Set the parameters of the LOF Model.
# !! Tweak this after looking at the data.
numNeighbors        = 20
contaminationRatio  = 0.05
#===============================================================#

# Anomaly
#===========================Fill This===========================#
# 1. Set the threshold for the LOF score.
# !! Tweak this after looking at the data.
# !! Use the guidelines as studied.
lofScoreThr = 1.6
#===============================================================#

Generate / Load Data#

The data set is composed of a timestamp (Resolution on 30 minutes) and the number of drives.

# Load Data

dfData = pd.read_csv(DATA_FILE_URL)

print(f'The features data shape: {dfData.shape}')

The features data shape: (10320, 2)

# Display the Data Frame

dfData.head(10)

	Time Stamp	Drives
0	2014-07-01 00:00:00	10844
1	2014-07-01 00:30:00	8127
2	2014-07-01 01:00:00	6210
3	2014-07-01 01:30:00	4656
4	2014-07-01 02:00:00	3820
5	2014-07-01 02:30:00	2873
6	2014-07-01 03:00:00	2369
7	2014-07-01 03:30:00	2064
8	2014-07-01 04:00:00	2221
9	2014-07-01 04:30:00	2158

Pre Process#

Convert the string into a Date Time format of Pandas.

# Convert the `Time Stamp` column into valid Pandas time stamp

#===========================Fill This===========================#
# 1. Use Pandas' `to_datetime()` to convert the `Time Stamp` column.
dfData['Time Stamp'] = pd.to_datetime(dfData['Time Stamp'])
#===============================================================#

Plot the Data#

# Plot the Data

# Plot the Data Using PlotLy
# This will create an interactive plot of the data (You may zoom in and out).
hF = px.line(data_frame = dfData, x = 'Time Stamp', y = ['Drives'], title = 'NYC Taxi Drives', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400, legend_title_text = 'Legend')
hF.show()

(?) Do you see some patterns in data?
(?) Can you spot some outliers? Why?

Feature Engineering#

Time series features engineering is an art.
Yet the basic features are the work on windows to extract statistical features: Mean, Standard Deviation, Median, etc…

The Pandas package has simple way to generate windows using the rolling() method.

# Resample Data for Hour Resolution
dfData = dfData.set_index('Time Stamp', drop = True, inplace = False)

# Resample per hour by summing
dfData = dfData.resample('H', axis = 0).sum()

/tmp/ipykernel_163986/2593492163.py:5: FutureWarning:

The 'axis' keyword in DataFrame.resample is deprecated and will be removed in a future version.

/tmp/ipykernel_163986/2593492163.py:5: FutureWarning:

'H' is deprecated and will be removed in a future version, please use 'h' instead.

# Display Sampled Data

dfData.head(10)

	Drives
Time Stamp
2014-07-01 00:00:00	18971
2014-07-01 01:00:00	10866
2014-07-01 02:00:00	6693
2014-07-01 03:00:00	4433
2014-07-01 04:00:00	4379
2014-07-01 05:00:00	6879
2014-07-01 06:00:00	17565
2014-07-01 07:00:00	29722
2014-07-01 08:00:00	38266
2014-07-01 09:00:00	39646

# Plot the Data Using PlotLy
hF = px.line(data_frame = dfData, x = dfData.index, y = ['Drives'], title = 'NYC Taxi Drives', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400, legend_title_text = 'Legend')
hF.show()

# Rolling Window Operator

def ApplyRollingWindow( dsI: pd.Series, winLength: int, winOperator: str ) -> pd.Series:
    # dsI - Input data series.
    # winLength - The window length to calculate the feature.
    # winOperator - The operation to apply on the window.

#===========================Fill This===========================#
# 1. Apply window functions by the string in `winOperator`: 'Standard Deviation', 'Median', 'Mean'.
# 2. Look at `rolling()`, `std()`, `median()` and `mean()`.
# 3. The pattern should be chaining the operation to the rolling operation: `dsI.rolling(winLength).std()`.
    if winOperator == 'Standard Deviation':
            dsO = dsI.rolling(winLength).std()
    elif winOperator == 'Median':
        dsO = dsI.rolling(winLength).median()
    else:
        dsO = dsI.rolling(winLength).mean()
#===============================================================#
    
    return dsO

(@) You may add more statistical features.
(?) Are those features applicable for this method?

# Apply the Feature Extraction / Generation

lColNames = ['Drives']
for winLen, opName in zip(lWinLength, lWinOperators):
    colName = opName + f'{winLen:03d}'
    lColNames.append(colName)
    dfData[colName] = ApplyRollingWindow(dfData['Drives'], winLen, opName)

(@) You may tweak the selection of window length and operation.

# Display Results on the Data Frame

dfData.head(20)

	Drives	Mean012	Mean024	Mean048	Standard Deviation024	Standard Deviation048	Median048
Time Stamp
2014-07-01 00:00:00	18971	NaN	NaN	NaN	NaN	NaN	NaN
2014-07-01 01:00:00	10866	NaN	NaN	NaN	NaN	NaN	NaN
2014-07-01 02:00:00	6693	NaN	NaN	NaN	NaN	NaN	NaN
2014-07-01 03:00:00	4433	NaN	NaN	NaN	NaN	NaN	NaN
2014-07-01 04:00:00	4379	NaN	NaN	NaN	NaN	NaN	NaN
2014-07-01 05:00:00	6879	NaN	NaN	NaN	NaN	NaN	NaN
2014-07-01 06:00:00	17565	NaN	NaN	NaN	NaN	NaN	NaN
2014-07-01 07:00:00	29722	NaN	NaN	NaN	NaN	NaN	NaN
2014-07-01 08:00:00	38266	NaN	NaN	NaN	NaN	NaN	NaN
2014-07-01 09:00:00	39646	NaN	NaN	NaN	NaN	NaN	NaN
2014-07-01 10:00:00	36704	NaN	NaN	NaN	NaN	NaN	NaN
2014-07-01 11:00:00	35712	20819.666667	NaN	NaN	NaN	NaN	NaN
2014-07-01 12:00:00	37794	22388.250000	NaN	NaN	NaN	NaN	NaN
2014-07-01 13:00:00	37637	24619.166667	NaN	NaN	NaN	NaN	NaN
2014-07-01 14:00:00	40137	27406.166667	NaN	NaN	NaN	NaN	NaN
2014-07-01 15:00:00	37924	30197.083333	NaN	NaN	NaN	NaN	NaN
2014-07-01 16:00:00	31241	32435.583333	NaN	NaN	NaN	NaN	NaN
2014-07-01 17:00:00	36728	34923.000000	NaN	NaN	NaN	NaN	NaN
2014-07-01 18:00:00	50564	37672.916667	NaN	NaN	NaN	NaN	NaN
2014-07-01 19:00:00	51731	39507.000000	NaN	NaN	NaN	NaN	NaN

(?) Why are there NaN values?

# Plot the Data Using PlotLy
hF = px.line(data_frame = dfData, x = dfData.index, y = lColNames, title = 'NYC Taxi Drives', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400, legend_title_text = 'Legend')
hF.show()

(@) Replace the features with local features such as:
- Ratio between the value to the mean value (Scaled by STD).
- Ratio between the value to the median value (Scaled by Median deviation).

Handle Missing Values#

Our model can not handle missing values.
Hence we must impute or remove them.

# Set the NaN Values to the first not NaN value in the column

#===========================Fill This===========================#
# 1. Loop over each column of the data frame.
# 2. Find the first valid index in each column (Use `first_valid_index()`).
# 3. Fill the NaN's up to the first valid value with the valid value.
for colName in lColNames:
    dsT = dfData[colName]
    firstValIdx = dsT.first_valid_index()
    dfData.loc[:firstValIdx, colName] = dfData.loc[firstValIdx, colName]

#===============================================================#

# Display the Results
# Should be no NaN's.

dfData

	Drives	Mean012	Mean024	Mean048	Standard Deviation024	Standard Deviation048	Median048
Time Stamp
2014-07-01 00:00:00	18971	20819.666667	31081.958333	30825.145833	15075.515428	14306.974068	36459.5
2014-07-01 01:00:00	10866	20819.666667	31081.958333	30825.145833	15075.515428	14306.974068	36459.5
2014-07-01 02:00:00	6693	20819.666667	31081.958333	30825.145833	15075.515428	14306.974068	36459.5
2014-07-01 03:00:00	4433	20819.666667	31081.958333	30825.145833	15075.515428	14306.974068	36459.5
2014-07-01 04:00:00	4379	20819.666667	31081.958333	30825.145833	15075.515428	14306.974068	36459.5
...	...	...	...	...	...	...	...
2015-01-31 19:00:00	56577	42386.916667	37830.458333	34813.979167	15600.330183	15027.530163	37951.5
2015-01-31 20:00:00	48276	44721.416667	37597.041667	34854.791667	15390.277972	15062.055831	37951.5
2015-01-31 21:00:00	48389	46113.333333	37431.291667	34896.312500	15245.028314	15097.253710	37951.5
2015-01-31 22:00:00	53030	47390.750000	37407.000000	35081.541667	15218.564587	15266.657277	37951.5
2015-01-31 23:00:00	52879	48202.750000	37404.958333	35379.104167	15216.394947	15474.396677	37951.5

5160 rows × 7 columns

The LOF Model#

The LOF algorithm basically learns the density of the distance to local neighbors and when the density is much lower than expected it sets the data as an outlier.

(#) The LOF is implemented by LocalOutlierFactor the class in SciKit Learn.

# Build the Model

#===========================Fill This===========================#
# 1. Construct the model.
# 2. Use `fit_predict()` on the data.
# 3. Extract the LOF Score.
# !! Mind the default LOF score sign.
oLofOutDet = LocalOutlierFactor(n_neighbors = numNeighbors, contamination = contaminationRatio)
vL         = oLofOutDet.fit_predict(dfData)
vLofScore  = -oLofOutDet.negative_outlier_factor_
#===============================================================#

# Plot the Data Using PlotLy
hF = px.histogram(x = vLofScore, title = 'LOF', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400)

hF.show()

(?) What threshold would you set?

# Set the LOF Score
dfData['LOF Score'] = vLofScore

# Set Anomaly

dfData['Anomaly'] = 0

dfData.loc[dfData['LOF Score'] > lofScoreThr,'Anomaly'] = 1

# Plot Anomalies 
hF = px.line(data_frame = dfData, x = dfData.index, y = ['Drives'], title = 'NYC Taxi Drives', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400, legend_title_text = 'Legend')

hF.add_scatter(x = dfData[dfData['Anomaly'] == 1].index, y = dfData.loc[dfData['Anomaly'] == 1, 'Drives'], name = 'Anomaly', mode = 'markers')

hF.show()

precision is not so good the recall is good.