Sentiment analysis is used to define the polarity of a sentence, document or review. It summarizes a steady stream of feedback by classifying the text as positive or negative. Roman Urdu is a common language in Asian and Middle Eastern countries. This project performs binary classification of Roman Urdu reviews. The preexisting annotated data is available in the English language. Using multiple tools, the conversion of English to Roman Urdu can be done and the resulting labeled data is divided into training and testing set. With the aid of feature extraction and representation methods, the training data is transformed to a sparse matrix and used by machine learning techniques to predict the unseen test samples. The results are analyzed with respect to the feature vectorization and selection methods used.
The work in sentiment analysis has been limited to few languages. In order to understand the sentiments of the countries in Asia and Middle East, sentiment analysis in Roman Urdu will give an accurate depiction of the general opinion. The labeled dataset used in classification is the first one to be publicly accessible. It will help future researchers in their social analytical methods. The best accuracy is achieved with linear SVC at 96%. Our proposed methodology can be studied and later improved to get better results.
The Data set
The dataset can be downloaded here