Wrangling with that Data! 3/4
This series shows how cleaning a CSV file using pandas, numpy, re and the
pyjanitor (imported under the name janitor) modules can be achieved. Some
outputs are shortened for readability.
Links To All Parts Of The Series
Data Preparation Series 1
Data Preparation Series 2
Data Preparation Series 3
Data Preparation Series 4
Summary Of This Article
Creation of a valid gps column for the records, by joining the
longitude and latitude columns together using geometry object Point from
library shapely.geometry. It lays the foundation for assigning from the
dataset completely independent geospatial features to the listings. Features
that prove significant for the prediction of variable ‘base_rent’ in the later
stages of the process. Further, a dtype timedelta64[ns] column is created
using datetime64[ns] type columns ‘date_listed’ and ‘date_unlisted’ to
calculate how long a listing was listed for on the platform
‘immoscout24.de’.
Summary Of The Series
- A DataFrame is given as input, that contains 47 columns at the beginning.
- Dimensionality Reduction is performed on the columns, to filter and only keep relevant columns.
- The
pyjanitor module is widely used with its method chaining syntax to increase the speed of the cleaning procedure.
- Unique values of each column give the basis for the steps needed to clean the columns.
- Regular Expressions (regex) are mostly used to extract cell contents, that hold the valid data.
- Regex are also used to replace invalid character patterns with valid ones.
- Validation of the values, after cleaning is performed using regex patterns.
- New
timedelta64[ns] time_listed and Point geometry gps columns are created.
Heating Costs
Looking at df.heating_costs.value_counts(), we see that the heating costs are
often included in the auxiliary costs. For 4047 rows, which is around
$\frac{1}{3}$ of all rows, it only states that heating costs are included in the
auxiliary without any numerical value. On average, heating costs should be
around the same for listings with one of the major heating types and similar
isolation and using normalized area inside a listing (given in $m^{2}$). Hamburg
is close to the Northern Sea and therefore the winters in Hamburg are generally
mild. There certainly is no continental climate, where heating make up a higher
percentage of the total rent.
The column is therefore dropped.
1
| df.heating_costs.value_counts().head(10)
|
1
2
3
4
5
6
7
8
9
10
11
| in Nebenkosten enthalten 4047
nicht in Nebenkosten enthalten 1070
keine Angabe 731
50 € 248
60 € 220
70 € 206
80 € 199
100 € 199
inkl. 50 € 150
90 € 144
Name: heating_costs, dtype: int64
|
1
| df.drop(labels=["heating_costs"], axis=1, inplace=True)
|
Latitude
This column is one of the most important ones in the dataset together with the Longitude column. Together they give the exact GPS coordinates for most of the listings. This spacial information will be joined with from the dataset independent external geospatial based information. Together these will lay the foundation for the most influential features used to train the XGBoost and Lasso Regression models.
1
| df.lat.value_counts().sample(10, random_state=seed)
|
1
2
3
4
5
6
7
8
9
10
11
| ['lat: 53.4859709952484,'] 1
['lat: 53.555415063775214,'] 1
['lat: 53.59607281949776,'] 1
['lat: 53.575903316512345,'] 1
['lat: 53.56514913880538,'] 1
['lat: 53.56863155760176,'] 3
['lat: 53.45789899804567,'] 1
['lat: 53.60452600011545,'] 1
['lat: 53.619085739824705,'] 1
['lat: 53.54586549494192,'] 5
Name: lat, dtype: int64
|
Cleaning Of Longitude And Latitude Columns
A look at a sample of the values found in the Latitude column, shows that they
most likely all follow the same pattern and thus can be easily converted to
floating point numbers. Generally, the json_ columns are much easier to clean,
than the values from the visible features on the URL of the listing. For the
following cases, the pandas.Series.str.extract() function is the tool of
choice. No difference in the number of unique values before and after the
cleaning for both columns is found.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| print(df.lat.nunique(), df.lng.nunique())
df = (
df.process_text(
column_name="lat",
string_function="extract",
pat=r"[^.]+?(\d{1,2}\.{1}\d{4,})",
expand=False,
)
.process_text(
column_name="lng",
string_function="extract",
pat=r"[^.]+?(\d{1,2}\.{1}\d{4,})",
expand=False,
)
.change_type(column_name="lat", dtype="float64", ignore_exception=False)
.change_type(column_name="lng", dtype="float64", ignore_exception=False)
)
print(df.lat.nunique(), df.lng.nunique())
|
Validation of Longitude and Latitude Columns
Only missing values in both columns do not match the validation pattern. This
was expected, since we did not add .dropna() to the list comprehension over
the values in the columns.
1
2
3
4
5
| bb = [x for x in df.lat.unique().tolist() if not re.match(r"\A\d+\.\d{4,}\Z", str(x))]
ba = [x for x in df.lng.unique().tolist() if not re.match(r"\A\d+\.\d{4,}\Z", str(x))]
print(ba)
print('\n')
print(bb)
|
Creating The GPS Column
To be able to utilize the Latitude and Longitude values, that we have in the
dataset, we need to create a tuple consisting of the two variables. The result
should look like this for all valid pairs: (lng,lat). In order for a tuple of
this form to be recognized as a valid GPS value, we need to be able to apply the
.apply(Point) method to all values and get no errors during the application.
We start by checking for problematic rows. Rows, that need attention are the
following:
- Rows that only have one valid value in the subset
df[['lng','lat']]
- Rows with missing values in both columns ‘lng’ and ‘lat’
The output shows, that all rows for the two columns are the same, in terms of
whether a row has a missing or valid value. With this information, we can save
the index values of the rows with missing values for the subset latitude and
longitude, so we can drop the rows with missing values in the gps column without
much effort in next steps.
1
| df[["lng", "lat"]].isna().value_counts()
|
1
2
3
4
| lng lat
False False 9423
True True 2901
dtype: int64
|
The method above is to be preferred over the one below, which creates lists for
both columns lng and lat, with the index values of the rows with NaN
values. It then compares the index values stored in lngna and latna
elementwise. The assert function is used to confirm, that the elements in both
lists are all equal.
1
2
3
| lngna = list(df[df["lng"].isna()].index)
latna = list(df[df["lat"].isna()].index)
assert lngna == lngna
|
Creation Of The GPS Column 1/3
The rows of the lng and lat columns are added together using the tuple
function inside a DataFrame.apply() statement and the result is assigned to
the new column gps.
1
| from shapely.geometry import Point
|
1
| df["gps"] = df[["lng", "lat"]].apply(tuple, axis=1)
|
We use the row index values from earlier. This makes it easy for us to drop the
missing values, regardless of the fact that a tuple of missing values can not be
dropped by using df.dropna(subset=['gps'],inplace=True) anymore. Such a tuple
is not an np.nan, it just has values ( np.nan, np.nan )inside the tuple.
1
| df.drop(lngna, inplace=True, axis=0)
|
Drop Rows That Contain NaN Values
We drop rows, that have NaN values in the lng and lat columns. Around 3000
rows where dropped as a result.
No more NaN values in all three columns, as expected.
1
| df[["lng", "lat", "gps"]].isna().value_counts()
|
1
2
3
| lng lat gps
False False False 9423
dtype: int64
|
Creation Of The GPS Column 2/3
Just valid values for variable longitude and latitude are left and therefore only valid values make up the data in the gps column.
We check how the values in the gps column look like, before the POINT conversion.
1
| df["gps"].sample(1000, random_state=seed)
|
1
2
3
4
5
6
7
8
9
10
11
12
| 4555 (10.081875491322997, 53.59659576773955)
5357 (10.117258625619199, 53.57200940640784)
5463 (10.076499662582167, 53.60477899815401)
4191 (9.948503601122527, 53.58407791510109)
7238 (9.944410649308205, 53.5642271656938)
...
10138 (10.024189216380647, 53.588549633689425)
3479 (9.859813703847099, 53.53408132036414)
5558 (10.03611960198576, 53.6124394982734)
11072 (10.009835277290465, 53.5488715679383)
4786 (9.944531180915247, 53.577945758643175)
Name: gps, Length: 1000, dtype: object
|
Conversion Of The gps Column To Geometry Object 3/3
The gps column is ready for conversion, by applying the Point function to all values in the column. The result needs no further processing.
1
| df["gps"] = df["gps"].apply(Point)
|
Getting A Quick Look At The Finished GPS Column
The gps column looks as expected, and we can see, that its dtype is correct as well:
Name: gps, dtype: object <class 'shapely.geometry.point.Point'>
1
| print(df["gps"][0:10], type(df["gps"][10]))
|
1
2
3
4
5
6
7
8
9
10
11
| 0 POINT (10.152357145948395 53.52943934146104)
1 POINT (10.06842724545814 53.58414266878421)
2 POINT (9.86423115803761 53.6044918821393)
3 POINT (9.956881419437474 53.56394970119381)
4 POINT (10.081257248271337 53.60180649336605)
5 POINT (10.032298671585043 53.561192834080046)
6 POINT (10.204297951920761 53.493636048600344)
7 POINT (9.973693895665868 53.56978432240341)
8 POINT (9.952844267614122 53.57611822321049)
9 POINT (10.036249068267281 53.56479904983683)
Name: gps, dtype: object <class 'shapely.geometry.point.Point'>
|
1
2
3
4
5
6
7
8
| <class 'pandas.core.frame.DataFrame'>
Int64Index: 9423 entries, 0 to 12323
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gps 9423 non-null object
dtypes: object(1)
memory usage: 405.3+ KB
|
Reviewing The Columns In The DataFrame Again
We drop more columns, after the creation of the gps column. street_number
(street and number of a listing) are not needed anymore, since all listings have
a valid pair of longitude and latitude coordinates associated with them. floor
space is a variable, that has 827 non-missing values and around 91.2% of rows
have missing values. This ratio between valid and missing values is too large to
be imputed in this case. The column is therefore dropped.
1
| df.drop(columns=["street_number", "floor_space"], inplace=True)
|
Date Listed & Date Unlisted Columns
For the columns, that give information about when a listing was published and
unpublished on the rental platform the needed cleaning steps are identical.
Therefore, not all steps are explicitly described for both columns. The values
need to be converted to dtype datetime, so the entire column can be assigned
dtype datetime64[ns].
1
2
3
| df.date_listed.unique()[
0:10
] # Entire list of unique values was used for the following steps.
|
1
2
3
4
5
6
7
8
9
10
| array(['[\'exposeOnlineSince: "03.12.2018"\']',
'[\'exposeOnlineSince: "02.12.2018"\']',
'[\'exposeOnlineSince: "30.11.2018"\']',
'[\'exposeOnlineSince: "29.11.2018"\']',
'[\'exposeOnlineSince: "28.11.2018"\']',
'[\'exposeOnlineSince: "27.11.2018"\']',
'[\'exposeOnlineSince: "26.11.2018"\']',
'[\'exposeOnlineSince: "25.11.2018"\']',
'[\'exposeOnlineSince: "24.11.2018"\']',
'[\'exposeOnlineSince: "23.11.2018"\']'], dtype=object)
|
- For both columns, we extract the valid data. Data in the form of
dd.mm.yyyy.
- pandas
datetime64[ns] goes down to the Nanosecond level. However, the data only goes down to the day level. This leads to us removing anything below the day level in the data.
- We fill missing data by selecting the next valid data entry, above the row with the missing value. The time a listing was online does not vary much, except for a few outliers, as will be discussed later.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| df = (
df.process_text(
column_name="date_listed",
string_function="extract",
pat=r"(\d{2}\.\d{2}\.\d{4})",
expand=False,
)
.process_text(
column_name="date_unlisted",
string_function="extract",
pat=r"(\d{2}\.\d{2}\.\d{4})",
expand=False,
)
.to_datetime("date_listed", errors="raise", dayfirst=True)
.to_datetime("date_unlisted", errors="raise", dayfirst=True)
.fill_direction(date_listed="up", date_unlisted="up")
.truncate_datetime_dataframe(datepart="day")
)
|
Exploring Data After Cleaning
1
2
| ppr = df[["date_listed", "date_unlisted"]][0:10]
print(ppr)
|
1
2
3
4
5
6
7
8
9
10
11
| date_listed date_unlisted
0 2018-12-03 2018-12-03
1 2018-12-03 2018-12-03
2 2018-12-03 2018-12-03
3 2018-12-03 2018-12-03
4 2018-12-03 2018-12-03
5 2018-12-02 2018-12-02
6 2018-11-30 2018-12-03
7 2018-11-30 2018-12-03
8 2018-11-30 2018-12-03
9 2018-12-03 2018-12-03
|
The time listed column is created calculating the difference between the
date_unlisted and date_listed columns. The result is the time_listed
column, which has type timedelta64[ns]. We truncate the timedelta values in
this column to only show the days, that the listing was online, since our data
only includes the day a listing was listed/unlisted.
1
| tg = df["date_unlisted"] - df["date_listed"]
|
Looking at several metrics for the timedelta64[ns] type column, we see that
there are negative values for a couple of rows. Looking at the corresponding
values of the date_listed and date_unlisted columns, we see that they most
likely are in the wrong order. The negative values are dealt with, after this
inspection, by using the absolute value of all timedelta64[ns] values. This
only corrects the negative deltas, while not altering the positive ones.
1
2
3
4
5
6
7
8
9
| print(tg.min())
print(tg.max())
print(tg.mean())
print(tg.median())
print(tg.quantile())
print(tg[tg.dt.days < 0].index.tolist())
indv = tg[tg.dt.days < 0].index.tolist()
df.loc[indv, ["date_listed", "date_unlisted"]]
|
1
2
3
4
5
6
| -648 days +00:00:00
924 days 00:00:00
19 days 05:10:04.011461318
5 days 00:00:00
5 days 00:00:00
[41, 578, 919, 1161, 1218, 1581, 2652, 2682, 2869, 2959, 3150, 3629, 3686, 6833, 7543, 7777, 7794, 11283, 11570, 11795, 11829, 11842, 11965, 12023]
|
|
date_listed |
date_unlisted |
| 41 |
2018-11-29 |
2018-11-28 |
| 578 |
2018-10-26 |
2018-10-25 |
| 919 |
2018-10-08 |
2018-10-07 |
| 1161 |
2018-09-24 |
2018-09-23 |
| 1218 |
2018-09-20 |
2018-09-19 |
| 1581 |
2018-11-01 |
2018-09-03 |
| 2652 |
2018-09-21 |
2018-09-20 |
| 2682 |
2018-09-03 |
2018-07-08 |
| 2869 |
2018-08-08 |
2018-06-28 |
| 2959 |
2018-08-08 |
2018-06-22 |
| 3150 |
2018-10-08 |
2018-06-13 |
| 3629 |
2018-06-15 |
2018-05-15 |
| 3686 |
2018-07-12 |
2018-06-04 |
| 6833 |
2018-11-23 |
2017-12-12 |
| 7543 |
2017-08-24 |
2017-08-23 |
| 7777 |
2017-09-21 |
2017-09-08 |
| 7794 |
2018-01-24 |
2017-08-08 |
| 11283 |
2018-08-17 |
2017-01-26 |
| 11570 |
2018-09-21 |
2016-12-12 |
| 11795 |
2018-09-10 |
2017-07-07 |
| 11829 |
2018-09-10 |
2017-09-22 |
| 11842 |
2018-10-01 |
2017-07-28 |
| 11965 |
2018-06-06 |
2017-10-30 |
| 12023 |
2018-06-21 |
2017-03-22 |
Time_Listed Column
The time_listed column is created and added to the DataFrame.
1
2
3
4
5
6
7
| df = (
df.add_column(
column_name="time_listed", value=df["date_unlisted"] - df["date_listed"]
)
# .change_type(column_name="time_listed", dtype="pd.Timedelta", ignore_exception=False)
.transform_column(column_name="time_listed", function=lambda x: np.abs(x))
)
|
The minimum is 0 days, as it should be and no missing data was added by the
cleaning steps.
1
2
| print(df["time_listed"].min())
print(df["time_listed"].isna().value_counts())
|
1
2
3
| 0 days 00:00:00
False 9423
Name: time_listed, dtype: int64
|
Dtype of time_listed
The dtype of the new column date_listed is timedelta64[ns], as it should be.
1
| df[['time_listed']].info()
|
1
2
3
4
5
6
7
8
| <class 'pandas.core.frame.DataFrame'>
Int64Index: 9423 entries, 0 to 12323
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 time_listed 9423 non-null timedelta64[ns]
dtypes: timedelta64[ns](1)
memory usage: 405.3 KB
|
Validation Of time_listed
df.describe() is used to show the distribution of time_listed, to verify
that the subtraction of date_listed from date_unlisted earlier resulted in
valid timedelta64[ns] values in the time_listed column. It becomes clear,
that there must be outliers in the data of the time_listed column, as the mean
is 20 days, while the median is 5 days. The mean is much more sensitive to
outliers in the way it is calculated compared to the median.
1
| df[['time_listed']].describe()
|
|
time_listed |
| count |
9423 |
| mean |
20 days 01:41:28.252148997 |
| std |
46 days 13:04:26.285141156 |
| min |
0 days 00:00:00 |
| 25% |
1 days 00:00:00 |
| 50% |
5 days 00:00:00 |
| 75% |
21 days 00:00:00 |
| max |
924 days 00:00:00 |
Column pc_city_quarter is dropped, since the gps column makes it redundant.
1
| df.drop(columns=["pc_city_quarter"], inplace=True)
|
Data Preparation Series 1
Data Preparation Series 2
Data Preparation Series 3
Data Preparation Series 4