Advanced Data Mining Project - data scraping, analysis and processing; models' training and evaluation

On this page, I present my analysis of textual reviews of kebab restaurants, scraped from Google Maps. I aimed to perform thorough experiments involving setting up the scraping pipeline, preprocessing the data, extracting useful numerical representations and training neural network-based models to recognize the sentiment of the reviews. Each review was converted into a set of linguistic representations, from classical ones (e.g. TF-IDF) to modern Transformer-based embeddings, and numerical features. I subsequently trained multiple configurations of neural network-based models to predict review rating on scale 1 to 5. Obtaining the results allowed me to gain meaningful insight into the data and draw conclusions on the applicability of different NLP techniques for sentiment analysis.

This project involves a systematic approach to ML engineering and includes concepts such as data versioning, experiment tracking, slice-based evaluation and thoughtful hyperparameter tuning.

Wiktor Prosowicz AGH University of Krakow wprosowicz@student.agh.edu.pl
Code

1. Data scraping & EDA

In order to scrape the Google Maps reviews from a chosen set of restaurants, I needed an automated pipeline that is able to interact with the service in a similar way a human would. This is crucial since only this way I can leverage the internal Google’s algorithms for restaurant lookup, as well as loading, sorting, and automated translation of the reviews. To this end, we incorporated Microsoft’s package Playwright, which allows us to automatically interact with a browser engine.

A screenshot of raw reviews scraped for an example location. The algorithm first found the restaurant for a given location specification and scraped up to a certain threshold of reviews. For each review, the algorithm extracted the information about the author (specifically, the total number of reviews they have written), the reviews' content (possibly including the original language), the rating on the scale 1 to 5 and an optional list of pre-defined options, such as the specification of the meal type.

The EDA involved analysing the scraped raw data in order to understand its structure, distribution, and potential insights. The distribution of various features was examined with respect to data categories, such as whether the review was originally in English.

The overall stats describing the train and test data sets.
Stat name Train set Test set
Num. of reviews 218,083 11,559
Num. of reviews written by author (min, max, avg) (1, 998, 38) (1, 920, 35)
Num. of primary locations 56 2
Num. of secondary locations 170 10
Num. of restaurants 1,294 73
Num. of reviews per restaurant (min, max, avg) (1, 300, 168) (11, 295, 158)
Num. of restaurants per primary location (min, max, avg) (1, 160, 23) (5, 68, 36)
Num. of non-English reviews 183,495 10,827

Insight into authors' review writing patterns

Distribution of the number of reviews written by the author with respect to the given rating. The outliers have been removed using the IQR method. It is visible that the average number of written reviews is higher for ratings from 2 - 4, which aligns with the intuition that there's a significant portion of users who do not make reviews often, except for the situations where they want to give an extremely positive or negative review.

Analysis of the distributions w.r.t. the location size

Distributions of the number of reviews written by the author and the average restaurant rating with respect to the location size. Since the number of scraped restaurants wasn't necessarily consistent with the real population size, the location size is rather an indicator of depth of the algorithmic search in a specific area. For this reason, the more restaurants have been scraped for a given primary location, the more likely it is that the algorithm collected not only the restaurants recommended by Google Maps, but also those that might have been missed otherwise. It is consistent with the concentration of high average restaurant ratings in areas with a a smaller number of restaurants (the promoted places span a huge portion of all restaurants scraped for the location).
Scatter plot showing the relationship between restaurant ratings and location size. The points form a triangular structure, with a significantly higher number of low-rated restaurants in the "bigger" locations.
Scatter plot showing the relationship between the proportion of non-english reviews in a location with respect to its size.

Analysis of the distributions w.r.t. the review original language

Distribution of review ratings for translated and non-translated reviews. It is visible that the proportion of 5-star ratings is higher in English reviews at the expense of the rest of the ratings.
The distributions of review lengths for each rating category for non-English and English reviews. It is visible that the average review length is higher in non-English reviews for all rating classes. This suggests that tourists tend to write shorter reviews.

Analysis of the categorical options

Top 3 most common categorized options present in the train dataset reviews. The number of reviews with no categorized options is 96885.
Categorized option name Num. of occurrences Num. of unique values
Service 103,987 7
Price per person 101,229 12
Meal type 80,875 5
Distributions of value occurrences for two example categorized options. The left picture shows a "good" distribution with relatively low number of unique values and high total number of occurrences. The right figure depicts an example of "bad" option, i.e. extremely high number of options and low total number of occurrences.

2. Data Processing & Post-processing Analysis

Processing the data involved two steps. First, the processor modules were fit the training data and used to produce the numerical features. Then, basic stats, such as mean and standard deviation, were calculated on the extracted features. Next step involved using the produced stats as well as the saved processors' states to obtain the test dataset. This two-step process ensures that there's no data leakage between the training and test sets.

The following classes of features were extracted from the raw data.

  1. BERT-based embeddings
    • sentence-level embeddings obtained from the [CLS] BERT's special token
    • available in two modes: a sequence of sentence embeddings or a single averaged embedding for the review
  2. Word count-based features
    • based on classical text processing techniques involving counting word occurrences
    • available as tensors containing features for the top-k most common words to control model size
    • used by the model in several possible modes: raw count, normalized count, TF-IDF and Binary Bag-of-Words
  3. POS-based features
    • a vector with part-of-speech tag frequencies
    • available in raw count and normalized count forms
  4. Categorical features
    • extracted based on a chosen subset of categorized options' values
    • an additional feature: quantized number of reviews written by the author
  5. Trace features
    • (velocity, volume) pairs inspired by this paper
    • calculated using spatial relationships between chunks created from token-level BERT embeddings using a given chunk size and step size
    • velocity indicates the pace with which a given review jumps from topic to topic
    • volume indicates the semantical richness of the review
    Illustration of the velocity and volume features.

Processing details

Hyperparameters used to obtain the train and test datasets.
Feature name Feature value
Used BERT model microsoft/deberta-v2-xxlarge
Chunk and step sizes for trace features (3, 1), (5, 2), (5, 5), (7, 3)
Categorized options used Meal type, Service, Price per person
Supported values for "Meal type" Lunch, Dinner, Other, Brunch, Breakfast
Supported values for "Service" Dine in, Take out, Delivery
Supported values for "Price per person" zł 1-20, zł 20-40, zł 40-60, zł 60-80, zł 80-100
Quantization ranges for number of reviews written by the author 0-10, 11-50, 51-100, 101-175, 176+

Analysis of the processed dataset

Distributions of two example categorical features w.r.t. to the original language of the review. The left figure shows the absolute numerical distribution and the right shows the percentage.
Scatter plot with trace features.
Distribution of the document frequency of the words in the dataset.

3. Model architecture & metrics

The ultimate goal of this work is to develop a neural network capable of predicting the rating of the given review. The model is designed to use a chosen subset of the data features and encode them using dedicated encoders. Depending on the nature of the input data, the model uses either Linguistic Encoders, that consist of a stack of fully-connected layers with ReLU activations, normalization and dropout, or Numerical Encoders, that leverage Kolmogorov-Arnold networks, which take longer to train but are more expressive.

All encoded features are subsequently concatenated and passed through a post-net that produces two kinds of rating prediction output: regressive rating prediction and rating classification either of which can be used to train the model. Optionally, classification of the model original language can be used as a secondary task.

The neural architecture of the rating predictor.

The following metrics are used during the training (the regression output is always rounded and clamped to produce a single rating class):

  1. Accuracy
  2. Precision, Recall - calculated with respect to each class and averaged in either the macro or weighted setup.
  3. Accuracy, Precision and Recall for translation flag prediction task
  4. Area Under the ROC Curve for translation flag prediction task
  5. Metrics for Coarse prediction - includes precision and recall calculated for the coarse-grained rating prediction, where ratings 1 and 2 are treated as the "bad" class and 3-5 are treated as the "good" class.

4. Experiments

The experiments involved training different setups of the model, where a single experiment's study was treated as standard hyperparameter tuning, used to obtain the best model for a given set of scientific parameters. Each experiment could involve comparing the best models of several studies to gain insight into the relationship between a certain parameter and the model's performance.

Due to the imperfect nature of the reviews rating task, there's no perfect algorithm to achieve 100% accuracy, as the rating is in practice often strongly subjective. It is reflected in the level of confusion in the prediction of neighboring classes. For this reason, I focused on evaluating the model's ability to predict the coarse "bad" and "good" classes. I believe this to be a more realistic and meaningful evaluation for this type of task.

Confusion matrix of rating class prediction for an example model's configuration.

Obtaining the baseline model

The distribution of the model's precision of calculating the "bad" class. The precision drops when the translation flag prediction loss increases, because the model focuses on optimizing the secondary loss.
The recall of predicting the "bad" class for two different studies. It is clear that including the secondary task improves the model's performance.
Distributions of the loss value with respect to the used learning rate. The figures come from two different studies and therefore have different scales. However, it is still visible that exploring the space of possible learning rates makes it possible to find an optimal configuration.

Ablation studies

The ablation studies were conducted to evaluate the contribution of each component in the model. For each study, the best model was chosen from a set of trained configurations. The optimized hyperparameters within each study included regularization and optimizer parameters.

Configurations of the ablation studies.
Setup name Description
Baseline The best model obtained in the initial experiments. It uses the secondary task of translation flag prediction. It also includes all input categorical features and the best performing trace features. The linguistic representations include sentence-level BERT embeddings and top 7000 TFIDF vector.
No numerical features The model does not use the numerical encoder for trace and numerical features.
No classical features The model does not include either numerical or classical linguistic features, such as POS tags and word count-based vectors.
Only classical features The model uses only the classical linguistic features, without BERT embeddings or numerical features.
No BERT sequences The only feature used by the model is review-level BERT embedding.

5. Evaluation

The evaluation was conducted on the test set which is independent from the training data. This separation aims to reflect testing the model in production - the models are optimized over the validation dataset and evaluated on independent data.

Evaluation results for all considered studies.
Setup name Avg. Precision Avg. Recall
Baseline 0.930 0.944
No numerical features 0.924 0.941
No classical features 0.921 0.934
Only classical features 0.915 0.936
No BERT sequences 0.870 0.913

Slice-based evaluation

Slice-based evaluation aims to get insight into the model's performance on chosen categories of the data. This type of evaluation allows to be aware of the possible model's weaknesses, e.g. bad performance on scarcely represented groups.

Evaluation results for English and Non-English reviews, obtained for the baseline model. It should be noted that the model's performance on the positive reviews is high for both categories, whereas there's a difference in the performance for the negative ones.
Original language (Precision, Recall) for "Bad" class (Precision, Recall) for "Good" class
English (0.68, 0.90) (0.98, 0.94)
Non-English (0.89, 0.94) (0.97, 0.94)
Evaluation results for different quantization groups of the number of reviews written by authors, obtained for the baseline model.
Category Precision Recall
Up to 10 0.949 0.964
10 to 50 0.925 0.937
50 to 100 0.884 0.901
100 to 175 0.895 0.912
175+ 0.848 0.865