Tobias Klein Get In Touch

Get In Touch

Prefer using email? Say hi at kle.tobias@googlemail.com

README.md

The README.md File! Read it.

README.md

The README.md file gives information about posts I wrote and has links to all of them. There are links to projects, which are in pdf format as well. It describes the contents of each post briefly, in order to make it easier for the reader to find what he or she is looking for.

Special Projects

Bachelor Thesis
Hyperparameter Optimisation
For Real Estate Prediction Models

Graphic shows the heat map of the base rent per square meter variable across the federal state of Hamburg, Germany.


A pdf version of the entire Bachelor Thesis (98 pages) can be accessed and downloaded here: Klein - Real Estate Prediction Models


Abstract


English Version


Combining a highly scalable and customisable process, with very accurate prediction results using machine learning models, is what this work proposes. The customisation is guided by what information the user seeks to gain from the process. This makes the process applicable for a variety of sectors, such as Banking & Finance, Marketing and urban development among others. It evaluates the process of using self-acquired data from an online real estate platform, gained from deploying a custom web scraping algorithm. This data is then combined with several spatial features for predicting the base rent for apartments on a validation dataset. The analysis and predictions are made for rental apartment listings within the Hanseatic City of Hamburg. The spatial features originate from sources other than that of the apartments data and have to be adapted to it first, therefore. Predictions are made using state of the art machine learning models, in the form of a Lasso Regression model and a XGBoost Regressor model. The Hyperparameter Optimisation techniques grid search and random search are compared, during the optimisation process. The focus is on maximising prediction accuracy of the models. The best scores, expressed in RMSE, are 190.68 for the Lasso and 115.39 for the XGBoost Regressor. Differences in complexity and interpretability between the models are discussed and associated with it, the strengths and weaknesses of the respective model are pointed out.



German Version


Die Kombination eines hoch skalierbaren und anpassbaren Prozesses mit sehr präzisen Vorhersageergebnissen unter Verwendung von Machine Learning Modellen ist der vor- geschlagene Ansatz dieser Arbeit. Die Anpassung richtet sich danach, welche Informa- tionen der Benutzer aus dem Prozess gewinnen möchte. Dadurch ist der Prozess für eine Vielzahl von Sektoren anwendbar, wie zum Beipiel Bankwesen & Finanzen, Mar- keting und Stadtentwicklung. Es bewertet den Prozess der Verwendung von selbst gewonnenen Daten, die durch den Einsatz eines eigenen Web-Scraping-Algorithmus von einer Online-Immobilienplattform gewonnen wurden. Diese Daten werden dann mit mehreren räumlichen Merkmalen kombiniert, um die Kaltmiete für Wohnungen auf einem Validierungsdatensatz vorherzusagen. Die Analysen und Prognosen wer- den für Mietwohnungsangebote in der Hansestadt Hamburg erstellt. Die räumlichen Merkmale stammen aus anderen Quellen als denen der Wohnungsdaten und müssen daher zunächst an diese angepasst werden. Die Vorhersagen werden mit Hilfe mod- ernster Machine Learning Modelle in Form eines Lasso-Regressionsmodells und eines XGBoost Regressor-Modells getroffen. Die Hyperparameter Optimierungstechniken Grid Search und Random Search werden während des Optimierungsprozesses ver- glichen. Der Fokus liegt auf der Maximierung der Vorhersagegenauigkeit der Mod- elle. Die besten Ergebnisse, ausgedrückt in RMSE, sind 190,68 für das Lasso und 115, 39 für den XGBoost Regressor. Unterschiede in der Komplexität und Interpretier- barkeit zwischen den Modellen werden diskutiert und damit verbunden, die Stärken und Schwächen des jeweiligen Modells aufgezeigt.



TOC

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
  Abstract

1. Introduction

    1.1 Scope & Research Questions

    1.2 Outline

2. Related Work & Foundation

    2.1 Related Work

    2.2 Fundamentals

3. Methods & Data

    3.1 Specification of the Problem

    3.2 Model Selection

    3.3 Collection of Data

    3.4 Preparation of Data

    3.5 Spatial Joins

    3.6 Overview - All Variables

4. Exploratory Data Analysis

    4.1 Univariate Distributions

    4.2 Correlation

    4.3 Heat Map of Base Rent

    4.4 Preprocessing

5. Machine Learning

    5.1 Hyperparameter Optimisation

    5.2 Overview - RMSE Values

    5.3 Lasso - Interpretation of Coefficients

    5.4 XGBoost - Feature Importance

6. Evaluation

    6.1 Comparison of Results

    6.2 Hyperparameter Optimisation - Results

    6.3 Limitations & Future Work

7. Discussion & Conclusion

    7.1 Dataset Construction

    7.2 Hyperparameter Optimisation - Discussion

8. Summary

    8.1 Review

    8.2 Applicability of the Process

  Bibliography

  Annex

  List of Figures

  List of Tables

  Listings



Blog Posts & Projects

Tags and links to the blog posts and project posts I have published on this site.

General Information

The following table explains certain notations, that are used in the descriptions of the articles. They follow the general notation used in comparable articles.

Notation Reference
df pandas.DataFrame Object
pd Alias for pandas Module
df.some_method Alias for pandas.DataFrame.some_method

Blog Posts


Multicollinearity: What It Is & Measures To Spot It.

Description: Two highly correlated distributions ($S1$, $S2$) are created from scratch. A scatter plot of both distributions distributions is created, with values on the x-axis from $S1$ and values on the y-axis from $S2$, to display what collinearity looks like. Covariance along with the prerequisites that need to be met for it to be a suitable metric, is first explored as a metric to spot collinearity. The mathematical formula and a shortcut for calculating it are presented. The second metric is the Pearson Correlation Coefficient. It is described much like the Covariance, along with an description the possible values of the correlation coefficient $r$. The articles concludes with an example of how the Pearson Correlation Coefficient can be used in a hypothesis test to determine, if $S1$ and $S2$ are at all correlated with each other.

Tags: multicollinearity, covariance, Pearson correlation coefficient, creation of two distributions that are tested for correlation, hypothesis test to gauge if the two distributions are at all correlated with each other.

Using Pandas To Import CSV Data Into MySQL

Description: The problem, along with the solution, that this article offers are presented. Following is a step by step guide, including the required python libraries, of how to import the CSV data into a pandas DataFrame object and how to export the DataFrame to a MySQL database as a MySQL table. The method was used several times by the author to create a MySQL table with over $4E7$ rows.

Tags: MySQL, Python3 script, utility, pandas, DataFrame, pd.read_csv, sqlalchemy, sqlalchemy.create_engine, sqlalchemy.types: Integer, Float, Text, DateTime, pd.df.to_sql


Projects


Data Preparation Series Part 4: Exploring Tabular Data With Pandas

Description: Showcase of how batch processing several columns of tabular data using pandas, pyjanitor and the re library can look like. Redundant columns are dropped, columns are reordered by type. Columns with dtype categorical are created and their classes converted to numerical values for the following evalutation of candidate models.

Tags: pandas, pyjanitor, method chaining, cleaning of tabular dataset, DataFrame, data validation using regex patterns, df.process_text, df.find_replace, df.fill_empty, df.rename_column, df.loc, df.drop, df.to_csv, df.info, df.reorder_columns, df.factorize_columns, df.filter

Data Preparation Series Part 3: Creation Of Point Geometry Column gps + timedelta64ns Column

Description: Creation of a valid gps column for the records, by joining the longitude and latitude columns together using geometry object Point from library shapely.geometry. It lays the foundation for assigning from the dataset completely independent geospatial features to the listings. Features that prove significant for the prediction of variable ‘base_rent’ in the later stages of the process. Further, a dtype timedelta64[ns] column is created using datetime64[ns] type columns ‘date_listed’ and ‘date_unlisted’ to calculate how long a listing was listed for on the platform ‘immoscout24.de’.

Tags: pandas, pyjanitor, method chaining, cleaning of tabular dataset, DataFrame, data validation using regex patterns, df.drop, df.process_text, df.change_type, df.dropna, df.to_datetime, df.fill_direction, df.truncate_datetime_dataframe, df.add_column, df.transform_column, df.describe, shapely.geometry.Point

Data Preparation Series Part 2: Foundation For Cleaning 47 Column Wide Tabular Data With Janitor & Pandas

Description: The DataFrame is explored and columns are processed and cleaned to give insight into how many values are missing among other properties.

Tags: pandas, pyjanitor, method chaining, cleaning of tabular dataset, DataFrame, data validation using regex patterns, df.rename_column, df.clean_names, df.remove_empty, df.drop, df.fill_empty, df.find_replace, df.value_counts, df.process_text, df.change_type, df.value_counts, df.value_counts, string_functions: extract, replace, lstrip, rstrip, regular expressions

Data Preparation Series Part 1: Exploring The 47 Column Wide Tabular Dataset Using Pandas

Description: An Overview of available tools in the pandas library.

Tags: df.head, df.tail, df.index.max, df.columns, df.index, df.shape, df.count, df.describe, .to_markdown, df.nunique, df.filter,df.sample,df.value_counts,df.drop_duplicates

-


Chapter 5: Section 13

Description: The systemctl command.

Tags: systemctl, shell, bash, centos 7.9, system administration, ssh, virtual machine

Chapter 5: Section 12

Description: Processes, jobs and scheduling.

Tags: application, script, daemon, threads, job, shell, bash, centos 7.9, system administration, ssh, virtual machine directory services & system utility commands

Chapter 5: Section 10 & 11

Description: Overview and comparison of popular directory services.

Tags: directory services, active directory service windows, identity manager red hat, winbind samba linux to windows, openldap red hat, ibm directoryserver, jumpcloud, ldap protocol, date, uptime, uname, which, cal, bc, shell, bash, centos 7.9, system administration, ssh, virtual machine

Chapter 5: Section 8 & 9

Description: Communicating with users & linux account management.

Tags: users, wall, write, local account, domain account, shell, bash, centos 7.9, system administration, ssh, virtual machine

Chapter 5: Section 7

Description: Monitoring users.

Tags: who, last, w, finger, id, shell, bash, centos 7.9, system administration, ssh, virtual machine

Chapter 5: Section 6

Description: How to switch users and sudo access.

Tags: sudo, visudo, su - username, shell, bash, centos 7.9, system administration, ssh, virtual machine

Chapter 5: Section 5

Description: chage command in depth.

Tags: chage, shell, bash, centos 7.9, system administration, ssh, virtual machine

Chapter 5: Section 4

Description: User account management.

Tags: useradd, groupadd, userdel, groupdel, usermod [-G], shell, bash, centos 7.9, system administration, ssh, virtual machine

Chapter 5: Section 3

Description: The sed command for text manipulation.

Tags: sed command, shell, bash, centos 7.9, system administration, ssh, virtual machine

Chapter 5: Section 1 & 2

Description: Linux text editors.

Tags: vi, vim, nvim, emacs, shell, bash, centos 7.9, system administration, ssh, virtual machine


You are welcome to take a look and browse through some of my posts.

I adhere to two principals, in that order.

I follow two principles, in this order. A methodically clean and conscientious approach followed by the clear and aesthetically pleasing communication of information. In addition, I do my best to use the available tools efficiently and flexibly. Be it the commands provided by the CentOS (~Red Hat Linux) distribution for system administration, the workflow in Python for reading raw data from .csv files, custom web scraping algorithms or from a database directly, to a production-ready predictive machine learning model that can be deployed via Docker or AWS and serve the client's purposes.

A balance between low deployment costs through virtualization and the use of scalable, on-demand cloud services that keep costs in check, and a competitive advantage for the customer through the use of the finished product.

Get in Touch