DOI QR코드

DOI QR Code

Stock News Dataset Quality Assessment by Evaluating the Data Distribution and the Sentiment Prediction

  • Alasmari, Eman (The Faculty of Computing and Information Technology, King Abdulaziz University) ;
  • Hamdy, Mohamed (The Faculty of Computing and Information Technology, King Abdulaziz University) ;
  • Alyoubi, Khaled H. (The Faculty of Computing and Information Technology, King Abdulaziz University) ;
  • Alotaibi, Fahd Saleh (The Faculty of Computing and Information Technology, King Abdulaziz University)
  • Received : 2022.02.05
  • Published : 2022.02.28

Abstract

This work provides a reliable and classified stocks dataset merged with Saudi stock news. This dataset allows researchers to analyze and better understand the realities, impacts, and relationships between stock news and stock fluctuations. The data were collected from the Saudi stock market via the Corporate News (CN) and Historical Data Stocks (HDS) datasets. As their names suggest, CN contains news, and HDS provides information concerning how stock values change over time. Both datasets cover the period from 2011 to 2019, have 30,098 rows, and have 16 variables-four of which they share and 12 of which differ. Therefore, the combined dataset presented here includes 30,098 published news pieces and information about stock fluctuations across nine years. Stock news polarity has been interpreted in various ways by native Arabic speakers associated with the stock domain. Therefore, this polarity was categorized manually based on Arabic semantics. As the Saudi stock market massively contributes to the international economy, this dataset is essential for stock investors and analyzers. The dataset has been prepared for educational and scientific purposes, motivated by the scarcity of data describing the impact of Saudi stock news on stock activities. It will, therefore, be useful across many sectors, including stock market analytics, data mining, statistics, machine learning, and deep learning. The data evaluation is applied by testing the data distribution of the categories and the sentiment prediction-the data distribution over classes and sentiment prediction accuracy. The results show that the data distribution of the polarity over sectors is considered a balanced distribution. The NB model is developed to evaluate the data quality based on sentiment classification, proving the data reliability by achieving 68% accuracy. So, the data evaluation results ensure dataset reliability, readiness, and high quality for any usage.

Keywords

References

  1. X. Li, H. Xie, L. Chen, J. Wang, and X. Deng, "News impact on stock price return via sentiment analysis," Knowledge-Based Syst., 2014.
  2. J. R. Pineiro-Chousa, M. A. Lopez-Cabarcos, and A. M. Perez-Pico, "Examining the influence of stock market variables on microblogging sentiment," J. Bus. Res., vol. 69, no. 6, pp. 2087-2092, 2016. https://doi.org/10.1016/j.jbusres.2015.12.013
  3. M. C. Mariani, M. A. M. Bhuiyan, O. K. Tweneboah, M. P. Beccar-Varela, and I. Florescu, "Analysis of stock market data by using Dynamic Fourier and Wavelets techniques," Phys. A Stat. Mech. its Appl., vol. 537, p. 122785, 2020. https://doi.org/10.1016/j.physa.2019.122785
  4. F. Jareno Cebrian, "The sensitivity of sectoral returns to real interest rates and inflation," Investig. Econ., 2006.
  5. X. Ding, Y. Zhang, T. Liu, and J. Duan, "Deep learning for event-driven stock prediction," in IJCAI International Joint Conference on Artificial Intelligence, 2015.
  6. Q. Al-Radaideh, A. Assaf, and E. Alnagi, "Predicting stock prices using data mining techniques," Int. Arab Conf. Inf. Technol., 2013.
  7. A. Mittal and A. Goel, "Stock Prediction Using Twitter Sentiment Analysis," http://cs229.stanford.edu/proj2011/GoelMittal-StockMarketPredictionUsingTwitterSentimentAnalysis.pdf, 2012.
  8. A. Badawi, A. AlQudah, and W. Rashideh, "Determinants of Foreign Portfolio Investment in Emerging Markets: Evidence from Saudi Stock Market," SSRN Electron. J., 2017.
  9. I. A. Gelil, N. Howarth, and A. Lanza, "Growth, Investment and the Low-Carbon Transition: A View from Saudi Arabia," Kapsarc, no. August, pp. 1-20, 2017.
  10. M. Alharbi, "The Reliance of the Saudi Economy and Adequacy of its Foreign Reserves with Reference to Oil Price Volatility: An Overview," Int. J. Bus. Adm. Stud., vol. 5, no. 6, pp. 329-339, 2019. https://doi.org/10.20469/ijbas.5.10003-6
  11. I. Henriques and P. Sadorsky, "Oil prices and the stock prices of alternative energy companies," Energy Econ., vol. 30, no. 3, pp. 998-1010, 2008. https://doi.org/10.1016/j.eneco.2007.11.001
  12. Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM computing surveys (CSUR), 41(3), 1-52.
  13. Huang, K., Lee, Y., and Wang, R. Quality Information and Knowledge. Prentice Hall, Upper Saddle River: N.J. 1999.
  14. Kahn, B. K., Strong, D. M., and Wang, R. Y. Information Quality Benchmarks: Product and Service Performance. Commun. ACM, (2002).
  15. "The official home of the Python Programming Language." [Online].
  16. Murphy, K. P, "Naive bayes classifiers," University of British Columbia, no. 18, vol. 60, pp. 1-8, 2006.
  17. Stehman, S. V., & Foody, G. M, "Accuracy assessment," In the SAGE handbook of remote sensing, London: Sage, pp. 297-309, 2009.
  18. Kieras, D. E., & Butler, K. A, "Task Analysis and the Design of Functionality," The computer science and engineering handbook, vol. 23, 1401-1423, 1997.
  19. Brownlee, J, "Machine learning algorithms from scratch with Python," Machine Learning Mastery, 2016.
  20. "The Saudi Stock Market Tadawul." [Online]. Available: https://www.tadawul.com.sa/wps/portal/tadawul/home/.
  21. Available: https://www.python.org/.
  22. A. Gaydhani, V. Doma, S. Kendre, and L. Bhagwat, "Detecting Hate Speech and Offensive Language on Twitter using Machine Learning: An N-gram and TFIDF based Approach," arXiv preprint arXiv:1809.08651, 2018.