Housing Auto Pricing (Part 2) - Data Preprocessing
Data Preprocessing
Yesterday, I successfully scraped the dataset. But to make the machine learning algorithm run smoothly later, I think it’s necessary to do data preprocessing.
Before preprocessing, let me show you my database structure.
Although it’s a quick glance, we can find that many data formats are not consistent, and some show “No Data”.
Let’s standardize each column!
house_title
This column clearly has duplicate data with “** square meters” at the end and the house_area column. We need to remove the " square meters" text.
Since it involves string slicing, we’ll do this in Python (not MySQL).
def house_title_format(self):
cursor_r = self.db.cursor()
cursor_w = self.db.cursor()
sqlR = "SELECT house_id, house_title FROM house_info;"
sqlW = "UPDATE house_info SET house_title = %s, rooms_type = %s WHERE house_id = %s;"
cursor_r.execute(sqlR)
for house_id, house_title in cursor_r.fetchall():
if "平米" in house_title:
# Extract room type
rooms_type = house_title.split("室")[0] + "室"
# Clean title
house_title = house_title.replace("平米", "").strip()
cursor_w.execute(sqlW, (house_title, rooms_type, house_id))
self.db.commit()
house_area
The “square meters” suffix needs to be removed and convert to numeric.
def house_area_format(self):
cursor = self.db.cursor()
sql = "SELECT house_id, house_area FROM house_info;"
cursor.execute(sql)
for house_id, house_area in cursor.fetchall():
if house_area and "平米" in house_area:
house_area = house_area.replace("平米", "").strip()
cursor.execute("UPDATE house_info SET house_area = %s WHERE house_id = %s",
(house_area, house_id))
self.db.commit()
Summary
Data preprocessing is crucial for machine learning:
- Remove duplicate information
- Convert to numeric types
- Handle missing values
- Standardize formats
Next, we’ll build the machine learning model!