Github link: https://github.com/mth522SNPP/Temporal-Analysis-of-Earning-Trends-in-Boston-City-Unveiling-Patterns-Departments-and-Job-Titles
BostonEarnings SUrveyDecember 5th, 2023
The Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors (SARIMAX) model is an extension of the ARIMA model, specifically designed to handle time series data with both seasonal patterns and external factors (exogenous variables). SARIMAX combines the concepts of ARIMA modeling with the ability to incorporate additional variables that might influence the time series behavior.
Here are the key components of the SARIMAX model:
- Seasonal Component (S):
- SARIMAX incorporates the notion of seasonality, allowing it to model repeating patterns in the time series that occur at regular intervals. This is particularly useful for data with clear seasonal trends, such as monthly or yearly patterns.
- Exogenous Variables (X):
- In addition to the seasonal and autoregressive components, SARIMAX allows for the inclusion of exogenous variables. These are external factors that might influence the time series but are not part of the time series itself. For example, if you are modeling sales data, you might include factors like marketing spending or promotional events as exogenous variables.
- AutoRegressive (AR) Component (p), Integrated (I) Component (d), and Moving Average (MA) Component (q):
- SARIMAX maintains the ARIMA model’s autoregressive, integrated, and moving average components, denoted as (p, d, q), allowing it to capture the temporal dependencies and trends within the time series.
November 27th, 2023
Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) are statistical tools used in time series analysis to understand and identify patterns, dependencies, and relationships within a time series data set.
- Autocorrelation Function (ACF):
- Definition: ACF measures the correlation between a time series and its lagged values. It helps identify the presence of seasonality or periodic patterns in the data.
- Interpretation: A positive autocorrelation at lag k indicates a positive correlation between the values k time units apart. A negative autocorrelation suggests an inverse relationship.
- Use: ACF is useful for determining the order of the Moving Average (MA) component in an ARIMA model. Peaks or significant spikes in the ACF plot at specific lags indicate potential seasonality or repeating patterns.
- Partial Autocorrelation Function (PACF):
- Definition: PACF measures the correlation between a time series and its lagged values after removing the effect of intervening lags. It helps identify direct relationships between observations at different lags.
- Interpretation: The PACF at lag k represents the correlation between observations k time units apart, removing the effects of the lags in between. It can be interpreted as the direct influence of one observation on another at a specific lag.
- Use: PACF is valuable for determining the order of the AutoRegressive (AR) component in an ARIMA model. Significant spikes in the PACF plot at specific lags indicate potential direct relationships.
November 22th, 2023
Time series refers to a series of data points collected or recorded in chronological order over regular intervals of time. These data points could represent various metrics such as stock prices, temperature readings, sales figures, or any other variable that changes over time. Time series analysis involves studying the patterns, trends, and characteristics within the data to make predictions or gain insights into future behavior.
ARIMA, which stands for AutoRegressive Integrated Moving Average, is a widely used statistical method for time series forecasting.
- AutoRegressive (AR) Component (p): This component captures the relationship between the current observation and its past values. The term “autoregressive” indicates that the model uses the relationship with its own past values.
- Integrated (I) Component (d): This component involves differencing the time series data to make it stationary. Stationary data has a constant mean and variance over time, making it easier to model. The order of differencing (d) represents the number of times differencing is applied to achieve stationarity.
- Moving Average (MA) Component (q): This component represents the relationship between the current observation and a residual error from a moving average model applied to past observations.
Here’s a brief breakdown of how ARIMA works:
- Stationarity: Ensure the time series data is stationary by applying differencing if necessary.
- Model Identification: Determine the values of p, d, and q based on the characteristics of the data, often aided by autocorrelation and partial autocorrelation plots.
- Parameter Estimation: Estimate the parameters of the ARIMA model.
- Model Validation: Validate the model’s performance using historical data.
- Forecasting: Use the fitted ARIMA model to make future predictions.
November 17th, 2023
Upon delving into the diverse departments, several noteworthy observations come to the fore. The Boston Police Department (BPD) stands out with a substantial income, a reflection of its expansive array of responsibilities dedicated to upholding public safety. The Boston Fire Department (BFD) similarly commands a significant allocation of funding, given its pivotal role in preventing and addressing fires and emergencies across the city.
Equally deserving of attention is the Boston Public Schools (BPS) Special Education department, which enjoys commendable funding. This financial support underscores the department’s crucial role in providing tailored assistance to students with disabilities, ensuring inclusivity and personalized educational support.

When it comes to individual job roles, teachers emerge at the forefront with the highest total earnings. This prominence can be attributed to the consistently high demand for teachers, recognizing their pivotal role in shaping the future through education. Teachers play a fundamental part in society, contributing significantly to the intellectual and emotional development of the next generation.
Following closely are police officers, entrusted with the paramount responsibility of safeguarding the public and upholding the law. The demanding nature of their profession, involving daily commitment to public safety and law enforcement, justifies their substantial compensation. This compensation serves as recognition for the sacrifices and risks that police officers undertake in the line of duty, exemplifying the essential nature of their contributions to community well-being.

November 14th 2023, Project 3
The dataset we are using here is Employee Earnings data from 2018 to 2022 in Boston.
The dataset contains detailed information on the salaries of employees in various departments in Boston, covering a span of five years with a total of 114,531 records.
Features:
- Employee names: This field identifies the name of each employee included in the report.
- Job titles: This field specifies the job title of each employee.
- Departments: This field indicates the department to which each employee is assigned. Regular earnings (base salary): This field represents the base salary of each employee. Retroactive payments: This field includes any retroactive payments received by employees, such as adjustments to past salaries or bonuses.
- Other payments: This field encompasses a range of additional payments made to employees, such as stipends, allowances, and reimbursements.
- Overtime pay: This field indicates the amount of overtime pay earned by each employee. Injured pay: This field includes any compensation received by employees for work-related injuries.
- Detail work: This field indicates any earnings received by employees for construction detail work.
- Quinn education incentive: This field shows any payments received by employees under the Quinn education incentive program, which rewards employees for pursuing higher education.
- Total earnings: This field represents the total earnings of each employee, including all regular and additional payments.
Project Report – 2
Project 1 – Updated version
November 8Th, KNN
KNN ALgorithm
The K-nearest neighbors (KNN) algorithm is a non-parametric supervised machine learning algorithm that can be used for both classification and regression tasks. It works by finding the K most similar training data points to a new data point and then using the labels of those K neighbors to predict the label of the new data point.
- Load the data
- Initialize K to your chosen number of neighbors
- For each example in the data
- Calculate the distance between the query example and the current example from the data.
- Add the distance and the index of the example to an ordered collection
- Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distance
- Pick the first K entries from the sorted collection
- Get the labels of the selected K entries
- If regression, return the mean of the K labels
- If classification, return the mode of the K labels
Novmeber 6th, Anova Test
ANOVA stands for Analysis of Variance. It is a statistical method used to analyze the differences between the means of two or more groups or treatments. It is often used to determine whether there are any statistically significant differences between the means of different groups.
ANOVA compares the variation between group means to the variation within the groups. If the variation between group means is significantly larger than the variation within groups, it suggests a significant difference between the means of the groups.
ANOVA works by partitioning the total variance in the data into two components: the variance between the groups and the variance within the groups. The variance between the groups is calculated by comparing the group means to the overall mean. The variance within the groups is calculated by measuring the variability of the data points within each group.
If the variance between the groups is significantly larger than the variance within the groups, then we can conclude that there is a statistically significant difference between the group means. This means that the independent variable has a significant effect on the dependent variable.
If the ANOVA test is statistically significant, then we can conclude that there is a difference in crop yield between the three groups. We can then use post-hoc tests to determine which groups are significantly different from each other.
