Fixel Algorithms

LOF Exercise#

Notebook by:

Revision History#

Version

Date

User

Content / Changes

1.0.000

13/04/2024

Royi Avital

First version

Open In Colab

# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.neighbors import LocalOutlierFactor

# Miscellaneous
import math
import os
from platform import python_version
import random
import timeit

# Typing
from typing import Callable, Dict, List, Optional, Self, Set, Tuple, Union

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image
from IPython.display import display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout, SelectionSlider
from ipywidgets import interact

Notations#

  • (?) Question to answer interactively.

  • (!) Simple task to add code for the notebook.

  • (@) Optional / Extra self practice.

  • (#) Note / Useful resource / Food for thought.

Code Notations:

someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler

Code Exercise#

  • Single line fill

vallToFill = ???
  • Multi Line to Fill (At least one)

# You need to start writing
????
  • Section to Fill

#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

???
#===============================================================#
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

DATA_FILE_URL = r'https://github.com/FixelAlgorithmsTeam/FixelCourses/raw/master/DataSets/NewYorkTaxiDrives.csv'
# Courses Packages
# General Auxiliary Functions

Anomaly Detection by Local Outlier Factor (LOF)#

In this exercise we’ll use the LOF algorithm to identify outlier in a time series data.
The data we’ll use is the number of taxi drives in New York City at 01/07/2014-01/02/2015 (Over 6 months).

In this notebook:

  • We’ll build a time series features.

  • Fit the LOF model to data.

  • Visualize outliers.

  • (#) For visualization the PlotLy library will be used.

# Parameters

# Feature Generation
lWinLength      = [12, 24, 48, 12, 24, 48, 24, 48]
lWinOperators   = ['Mean', 'Mean', 'Mean', 'Mean', 'Standard Deviation', 'Standard Deviation', 'Standard Deviation', 'Median', 'Median']

# Model
#===========================Fill This===========================#
# 1. Set the parameters of the LOF Model.
# !! Tweak this after looking at the data.
numNeighbors        = 20
contaminationRatio  = 0.05
#===============================================================#

# Anomaly
#===========================Fill This===========================#
# 1. Set the threshold for the LOF score.
# !! Tweak this after looking at the data.
# !! Use the guidelines as studied.
lofScoreThr = 1.6
#===============================================================#

Generate / Load Data#

The data set is composed of a timestamp (Resolution on 30 minutes) and the number of drives.

# Load Data

dfData = pd.read_csv(DATA_FILE_URL)

print(f'The features data shape: {dfData.shape}')
The features data shape: (10320, 2)
# Display the Data Frame

dfData.head(10)
Time Stamp Drives
0 2014-07-01 00:00:00 10844
1 2014-07-01 00:30:00 8127
2 2014-07-01 01:00:00 6210
3 2014-07-01 01:30:00 4656
4 2014-07-01 02:00:00 3820
5 2014-07-01 02:30:00 2873
6 2014-07-01 03:00:00 2369
7 2014-07-01 03:30:00 2064
8 2014-07-01 04:00:00 2221
9 2014-07-01 04:30:00 2158

Pre Process#

Convert the string into a Date Time format of Pandas.

# Convert the `Time Stamp` column into valid Pandas time stamp

#===========================Fill This===========================#
# 1. Use Pandas' `to_datetime()` to convert the `Time Stamp` column.
dfData['Time Stamp'] = pd.to_datetime(dfData['Time Stamp'])
#===============================================================#

Plot the Data#

# Plot the Data

# Plot the Data Using PlotLy
# This will create an interactive plot of the data (You may zoom in and out).
hF = px.line(data_frame = dfData, x = 'Time Stamp', y = ['Drives'], title = 'NYC Taxi Drives', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400, legend_title_text = 'Legend')
hF.show()
  • (?) Do you see some patterns in data?

  • (?) Can you spot some outliers? Why?

Feature Engineering#

Time series features engineering is an art.
Yet the basic features are the work on windows to extract statistical features: Mean, Standard Deviation, Median, etc…

The Pandas package has simple way to generate windows using the rolling() method.

# Resample Data for Hour Resolution
dfData = dfData.set_index('Time Stamp', drop = True, inplace = False)

# Resample per hour by summing
dfData = dfData.resample('H', axis = 0).sum()
/tmp/ipykernel_163986/2593492163.py:5: FutureWarning:

The 'axis' keyword in DataFrame.resample is deprecated and will be removed in a future version.

/tmp/ipykernel_163986/2593492163.py:5: FutureWarning:

'H' is deprecated and will be removed in a future version, please use 'h' instead.
# Display Sampled Data

dfData.head(10)
Drives
Time Stamp
2014-07-01 00:00:00 18971
2014-07-01 01:00:00 10866
2014-07-01 02:00:00 6693
2014-07-01 03:00:00 4433
2014-07-01 04:00:00 4379
2014-07-01 05:00:00 6879
2014-07-01 06:00:00 17565
2014-07-01 07:00:00 29722
2014-07-01 08:00:00 38266
2014-07-01 09:00:00 39646
# Plot the Data Using PlotLy
hF = px.line(data_frame = dfData, x = dfData.index, y = ['Drives'], title = 'NYC Taxi Drives', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400, legend_title_text = 'Legend')
hF.show()
# Rolling Window Operator

def ApplyRollingWindow( dsI: pd.Series, winLength: int, winOperator: str ) -> pd.Series:
    # dsI - Input data series.
    # winLength - The window length to calculate the feature.
    # winOperator - The operation to apply on the window.

#===========================Fill This===========================#
# 1. Apply window functions by the string in `winOperator`: 'Standard Deviation', 'Median', 'Mean'.
# 2. Look at `rolling()`, `std()`, `median()` and `mean()`.
# 3. The pattern should be chaining the operation to the rolling operation: `dsI.rolling(winLength).std()`.
    if winOperator == 'Standard Deviation':
            dsO = dsI.rolling(winLength).std()
    elif winOperator == 'Median':
        dsO = dsI.rolling(winLength).median()
    else:
        dsO = dsI.rolling(winLength).mean()
#===============================================================#
    
    return dsO
  • (@) You may add more statistical features.

  • (?) Are those features applicable for this method?

# Apply the Feature Extraction / Generation

lColNames = ['Drives']
for winLen, opName in zip(lWinLength, lWinOperators):
    colName = opName + f'{winLen:03d}'
    lColNames.append(colName)
    dfData[colName] = ApplyRollingWindow(dfData['Drives'], winLen, opName)
  • (@) You may tweak the selection of window length and operation.

# Display Results on the Data Frame

dfData.head(20)
Drives Mean012 Mean024 Mean048 Standard Deviation024 Standard Deviation048 Median048
Time Stamp
2014-07-01 00:00:00 18971 NaN NaN NaN NaN NaN NaN
2014-07-01 01:00:00 10866 NaN NaN NaN NaN NaN NaN
2014-07-01 02:00:00 6693 NaN NaN NaN NaN NaN NaN
2014-07-01 03:00:00 4433 NaN NaN NaN NaN NaN NaN
2014-07-01 04:00:00 4379 NaN NaN NaN NaN NaN NaN
2014-07-01 05:00:00 6879 NaN NaN NaN NaN NaN NaN
2014-07-01 06:00:00 17565 NaN NaN NaN NaN NaN NaN
2014-07-01 07:00:00 29722 NaN NaN NaN NaN NaN NaN
2014-07-01 08:00:00 38266 NaN NaN NaN NaN NaN NaN
2014-07-01 09:00:00 39646 NaN NaN NaN NaN NaN NaN
2014-07-01 10:00:00 36704 NaN NaN NaN NaN NaN NaN
2014-07-01 11:00:00 35712 20819.666667 NaN NaN NaN NaN NaN
2014-07-01 12:00:00 37794 22388.250000 NaN NaN NaN NaN NaN
2014-07-01 13:00:00 37637 24619.166667 NaN NaN NaN NaN NaN
2014-07-01 14:00:00 40137 27406.166667 NaN NaN NaN NaN NaN
2014-07-01 15:00:00 37924 30197.083333 NaN NaN NaN NaN NaN
2014-07-01 16:00:00 31241 32435.583333 NaN NaN NaN NaN NaN
2014-07-01 17:00:00 36728 34923.000000 NaN NaN NaN NaN NaN
2014-07-01 18:00:00 50564 37672.916667 NaN NaN NaN NaN NaN
2014-07-01 19:00:00 51731 39507.000000 NaN NaN NaN NaN NaN
  • (?) Why are there NaN values?

# Plot the Data Using PlotLy
hF = px.line(data_frame = dfData, x = dfData.index, y = lColNames, title = 'NYC Taxi Drives', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400, legend_title_text = 'Legend')
hF.show()
  • (@) Replace the features with local features such as:

    • Ratio between the value to the mean value (Scaled by STD).

    • Ratio between the value to the median value (Scaled by Median deviation).

Handle Missing Values#

Our model can not handle missing values.
Hence we must impute or remove them.

# Set the NaN Values to the first not NaN value in the column

#===========================Fill This===========================#
# 1. Loop over each column of the data frame.
# 2. Find the first valid index in each column (Use `first_valid_index()`).
# 3. Fill the NaN's up to the first valid value with the valid value.
for colName in lColNames:
    dsT = dfData[colName]
    firstValIdx = dsT.first_valid_index()
    dfData.loc[:firstValIdx, colName] = dfData.loc[firstValIdx, colName]

#===============================================================#
# Display the Results
# Should be no NaN's.

dfData
Drives Mean012 Mean024 Mean048 Standard Deviation024 Standard Deviation048 Median048
Time Stamp
2014-07-01 00:00:00 18971 20819.666667 31081.958333 30825.145833 15075.515428 14306.974068 36459.5
2014-07-01 01:00:00 10866 20819.666667 31081.958333 30825.145833 15075.515428 14306.974068 36459.5
2014-07-01 02:00:00 6693 20819.666667 31081.958333 30825.145833 15075.515428 14306.974068 36459.5
2014-07-01 03:00:00 4433 20819.666667 31081.958333 30825.145833 15075.515428 14306.974068 36459.5
2014-07-01 04:00:00 4379 20819.666667 31081.958333 30825.145833 15075.515428 14306.974068 36459.5
... ... ... ... ... ... ... ...
2015-01-31 19:00:00 56577 42386.916667 37830.458333 34813.979167 15600.330183 15027.530163 37951.5
2015-01-31 20:00:00 48276 44721.416667 37597.041667 34854.791667 15390.277972 15062.055831 37951.5
2015-01-31 21:00:00 48389 46113.333333 37431.291667 34896.312500 15245.028314 15097.253710 37951.5
2015-01-31 22:00:00 53030 47390.750000 37407.000000 35081.541667 15218.564587 15266.657277 37951.5
2015-01-31 23:00:00 52879 48202.750000 37404.958333 35379.104167 15216.394947 15474.396677 37951.5

5160 rows × 7 columns

The LOF Model#

The LOF algorithm basically learns the density of the distance to local neighbors and when the density is much lower than expected it sets the data as an outlier.

# Build the Model

#===========================Fill This===========================#
# 1. Construct the model.
# 2. Use `fit_predict()` on the data.
# 3. Extract the LOF Score.
# !! Mind the default LOF score sign.
oLofOutDet = LocalOutlierFactor(n_neighbors = numNeighbors, contamination = contaminationRatio)
vL         = oLofOutDet.fit_predict(dfData)
vLofScore  = -oLofOutDet.negative_outlier_factor_
#===============================================================#
# Plot the Data Using PlotLy
hF = px.histogram(x = vLofScore, title = 'LOF', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400)

hF.show()
  • (?) What threshold would you set?

# Set the LOF Score
dfData['LOF Score'] = vLofScore
# Set Anomaly

dfData['Anomaly'] = 0

dfData.loc[dfData['LOF Score'] > lofScoreThr,'Anomaly'] = 1
# Plot Anomalies 
hF = px.line(data_frame = dfData, x = dfData.index, y = ['Drives'], title = 'NYC Taxi Drives', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400, legend_title_text = 'Legend')

hF.add_scatter(x = dfData[dfData['Anomaly'] == 1].index, y = dfData.loc[dfData['Anomaly'] == 1, 'Drives'], name = 'Anomaly', mode = 'markers')

hF.show()

precision is not so good the recall is good.