The limited dataset size (18,000 stories) restricts the generalizability of the findings to larger and more diverse datasets. Deep learning models, especially RNNs, often benefit from significantly more data.
The study uses a basic word embedding (Polyglot) and does not explore more advanced techniques like contextualized embeddings (e.g., BERT, RoBERTa). Contextualized embeddings capture richer semantic information and could lead to improved performance.
Lack of Hyperparameter Tuning
The hyperparameters of the RNN models were not thoroughly tuned. Different architectures may require different optimal settings, and the default parameters used in the study may not have been ideal for either GRU or LSTM.
Limited Evaluation Metrics
The evaluation focuses on accuracy, sensitivity, and specificity but does not consider other important metrics like precision, F1-score, or area under the ROC curve (AUC). A more comprehensive evaluation would provide a better understanding of the models' performance.
Domain-Specific Application
The study's focus on a specific psychological test (TAT/PSE) limits the broader applicability of the findings to other text classification tasks. It's unclear how well these results generalize to other domains.