On this page, I present my analysis of textual reviews of kebab restaurants, scraped from Google Maps. I aimed to perform thorough experiments involving setting up the scraping pipeline, preprocessing the data, extracting useful numerical representations and training neural network-based models to recognize the sentiment of the reviews. Each review was converted into a set of linguistic representations, from classical ones (e.g. TF-IDF) to modern Transformer-based embeddings, and numerical features. I subsequently trained multiple configurations of neural network-based models to predict review rating on scale 1 to 5. Obtaining the results allowed me to gain meaningful insight into the data and draw conclusions on the applicability of different NLP techniques for sentiment analysis.
This project involves a systematic approach to ML engineering and includes concepts such as data versioning, experiment tracking, slice-based evaluation and thoughtful hyperparameter tuning.
In order to scrape the Google Maps reviews from a chosen set of restaurants, I needed an automated pipeline that is able to interact with the service in a similar way a human would. This is crucial since only this way I can leverage the internal Google’s algorithms for restaurant lookup, as well as loading, sorting, and automated translation of the reviews. To this end, we incorporated Microsoft’s package Playwright, which allows us to automatically interact with a browser engine.
The EDA involved analysing the scraped raw data in order to understand its structure, distribution, and potential insights. The distribution of various features was examined with respect to data categories, such as whether the review was originally in English.
| Stat name | Train set | Test set |
|---|---|---|
| Num. of reviews | 218,083 | 11,559 |
| Num. of reviews written by author (min, max, avg) | (1, 998, 38) | (1, 920, 35) |
| Num. of primary locations | 56 | 2 |
| Num. of secondary locations | 170 | 10 |
| Num. of restaurants | 1,294 | 73 |
| Num. of reviews per restaurant (min, max, avg) | (1, 300, 168) | (11, 295, 158) |
| Num. of restaurants per primary location (min, max, avg) | (1, 160, 23) | (5, 68, 36) |
| Num. of non-English reviews | 183,495 | 10,827 |
| Categorized option name | Num. of occurrences | Num. of unique values |
|---|---|---|
| Service | 103,987 | 7 |
| Price per person | 101,229 | 12 |
| Meal type | 80,875 | 5 |
Processing the data involved two steps. First, the processor modules were fit the training data and used to produce the numerical features. Then, basic stats, such as mean and standard deviation, were calculated on the extracted features. Next step involved using the produced stats as well as the saved processors' states to obtain the test dataset. This two-step process ensures that there's no data leakage between the training and test sets.
The following classes of features were extracted from the raw data.
| Feature name | Feature value |
|---|---|
| Used BERT model | microsoft/deberta-v2-xxlarge |
| Chunk and step sizes for trace features | (3, 1), (5, 2), (5, 5), (7, 3) |
| Categorized options used | Meal type, Service, Price per person |
| Supported values for "Meal type" | Lunch, Dinner, Other, Brunch, Breakfast |
| Supported values for "Service" | Dine in, Take out, Delivery |
| Supported values for "Price per person" | zł 1-20, zł 20-40, zł 40-60, zł 60-80, zł 80-100 |
| Quantization ranges for number of reviews written by the author | 0-10, 11-50, 51-100, 101-175, 176+ |
The ultimate goal of this work is to develop a neural network capable of predicting the rating of the given review. The model is designed to use a chosen subset of the data features and encode them using dedicated encoders. Depending on the nature of the input data, the model uses either Linguistic Encoders, that consist of a stack of fully-connected layers with ReLU activations, normalization and dropout, or Numerical Encoders, that leverage Kolmogorov-Arnold networks, which take longer to train but are more expressive.
All encoded features are subsequently concatenated and passed through a post-net that produces two kinds of rating prediction output: regressive rating prediction and rating classification either of which can be used to train the model. Optionally, classification of the model original language can be used as a secondary task.
The following metrics are used during the training (the regression output is always rounded and clamped to produce a single rating class):
The experiments involved training different setups of the model, where a single experiment's study was treated as standard hyperparameter tuning, used to obtain the best model for a given set of scientific parameters. Each experiment could involve comparing the best models of several studies to gain insight into the relationship between a certain parameter and the model's performance.
Due to the imperfect nature of the reviews rating task, there's no perfect algorithm to achieve 100% accuracy, as the rating is in practice often strongly subjective. It is reflected in the level of confusion in the prediction of neighboring classes. For this reason, I focused on evaluating the model's ability to predict the coarse "bad" and "good" classes. I believe this to be a more realistic and meaningful evaluation for this type of task.
The ablation studies were conducted to evaluate the contribution of each component in the model. For each study, the best model was chosen from a set of trained configurations. The optimized hyperparameters within each study included regularization and optimizer parameters.
| Setup name | Description |
|---|---|
| Baseline | The best model obtained in the initial experiments. It uses the secondary task of translation flag prediction. It also includes all input categorical features and the best performing trace features. The linguistic representations include sentence-level BERT embeddings and top 7000 TFIDF vector. |
| No numerical features | The model does not use the numerical encoder for trace and numerical features. |
| No classical features | The model does not include either numerical or classical linguistic features, such as POS tags and word count-based vectors. |
| Only classical features | The model uses only the classical linguistic features, without BERT embeddings or numerical features. |
| No BERT sequences | The only feature used by the model is review-level BERT embedding. |
The evaluation was conducted on the test set which is independent from the training data. This separation aims to reflect testing the model in production - the models are optimized over the validation dataset and evaluated on independent data.
| Setup name | Avg. Precision | Avg. Recall |
|---|---|---|
| Baseline | 0.930 | 0.944 |
| No numerical features | 0.924 | 0.941 |
| No classical features | 0.921 | 0.934 |
| Only classical features | 0.915 | 0.936 |
| No BERT sequences | 0.870 | 0.913 |
Slice-based evaluation aims to get insight into the model's performance on chosen categories of the data. This type of evaluation allows to be aware of the possible model's weaknesses, e.g. bad performance on scarcely represented groups.
| Original language | (Precision, Recall) for "Bad" class | (Precision, Recall) for "Good" class |
|---|---|---|
| English | (0.68, 0.90) | (0.98, 0.94) |
| Non-English | (0.89, 0.94) | (0.97, 0.94) |
| Category | Precision | Recall |
|---|---|---|
| Up to 10 | 0.949 | 0.964 |
| 10 to 50 | 0.925 | 0.937 |
| 50 to 100 | 0.884 | 0.901 |
| 100 to 175 | 0.895 | 0.912 |
| 175+ | 0.848 | 0.865 |