Data Preprocessing

Yesterday, I successfully scraped the dataset. But to make the machine learning algorithm run smoothly later, I think it’s necessary to do data preprocessing.

Before preprocessing, let me show you my database structure.

Although it’s a quick glance, we can find that many data formats are not consistent, and some show “No Data”.

Let’s standardize each column!


house_title

This column clearly has duplicate data with “** square meters” at the end and the house_area column. We need to remove the " square meters" text.

Since it involves string slicing, we’ll do this in Python (not MySQL).

def house_title_format(self):
    cursor_r = self.db.cursor()
    cursor_w = self.db.cursor()
    sqlR = "SELECT house_id, house_title FROM house_info;"
    sqlW = "UPDATE house_info SET house_title = %s, rooms_type = %s WHERE house_id = %s;"
    
    cursor_r.execute(sqlR)
    for house_id, house_title in cursor_r.fetchall():
        if "平米" in house_title:
            # Extract room type
            rooms_type = house_title.split("室")[0] + "室"
            # Clean title
            house_title = house_title.replace("平米", "").strip()
            cursor_w.execute(sqlW, (house_title, rooms_type, house_id))
    
    self.db.commit()

house_area

The “square meters” suffix needs to be removed and convert to numeric.

def house_area_format(self):
    cursor = self.db.cursor()
    sql = "SELECT house_id, house_area FROM house_info;"
    
    cursor.execute(sql)
    for house_id, house_area in cursor.fetchall():
        if house_area and "平米" in house_area:
            house_area = house_area.replace("平米", "").strip()
            cursor.execute("UPDATE house_info SET house_area = %s WHERE house_id = %s", 
                         (house_area, house_id))
    
    self.db.commit()

Summary

Data preprocessing is crucial for machine learning:

  1. Remove duplicate information
  2. Convert to numeric types
  3. Handle missing values
  4. Standardize formats

Next, we’ll build the machine learning model!