0、摘要

In this paper, we extract the qualitative information from crude oil news headlines, and develop a novel VMD- BiLSTM model with investor sentiment indicator for crude oil forecasting.

本文中，我们提取了原油新闻标题的定性信息，develop a novel VMD-BiLSTM模型进行原油价格预测。

First, we construct a sentiment score considering cumulative effect from contextual data of oil news texts.

第一，我们构建了考虑cumulative effect的sentiment score。

Then, we adopt an event-based method and GARCH model to investigate the impact of news sentiment on returns and volatility. A non-recursive signal decomposition method, namely variational mode decomposition (VMD), is applied to decompose the historical crude oil return and volatility data into various intrinsic modes.

第二，我们考虑了事件分析法以及Garch模型，以investigate情绪指标对收益及波动率的影响。non-recursive signal decomposition method(VMD)应用于分解原油历史收益和波动率。

After that, a bidirectional long short-term memory neural networks (BiLSTM) is introduced as the deep learning prediction model that integrates both the qualitative and quantitative model inputs.

随后，双向long short-term memory neura networks深度学习预测模型被应用，结合了定性及定量的输入。

Our empirical results indicate that the shock of news sentiment signiﬁcantly causes the ﬂuctuation of oil futures prices, and news sentiment has an asymmetric impact on the volatility of oil futures. The incorporation of sentiment score is always helpful for improving the forecasting performances in all benchmark scenarios. Speciﬁcally, our proposed data-decomposition based deep learning model is more effective than several econometric and machine learning models.

我们的实证结果表明，the shock of news sentiment显著导致原油价格fluctuation，并且news sentiment对oil futures的波动率具有非对称影响；纳入sentiment score有助于改善预测表现。

1、Introduction

It is universally acknowledged that the crude oil futures market is a typical risk aggregation market that attracts worldwide attentions: The oil future price movements are identiﬁed to be more likely exposed to global political events and receive simultaneous shocks from other ﬁ- nancial asset markets (Leduc and Sill, 2004; Considine and Larson, 2001).

众所周知，原油期货市场是一个typical risk aggregation market, which attracts worldwide attentions：原油期货价格波动exposed to全球政治事件，其他金融市场的同期震荡。

On the other hand, the prices of global ﬁnancial assets also receive positive feedbacks from oil future price movements and disturbances (Hamilton and Wu, 2014; Teterin et al., 2016).

反过来，全球金融资产价格也会受到positive feedbacks of oil future price.

For crude oil safety management and assets allocation strategy concerns, the precise prediction for oil future market returns and risks is able to provide useful guideline for policy makers and investors.

因此，the precise prediction for oil future market returns and risks is useful.

However, the nonlinearity property of oil future market price is formulated by different types of factors, such as supply and demand relationship (Kilian, 2009), international events (Zhao et al., 2016) and investor sentiment (Qadan and Nama, 2018), which makes it a tough task in oil returns and volatilities prediction for the complex structures of oil price.

但是，其复杂的机制导致预测oil returns and volatilities很困难。

An abundant amount of studies is devoted to predict the oil returns and volatilities utilizing the historical time-series data of oil market and economic related inﬂuencing factors. For example, Fan et al. (2008) use historical observations of WTI and Brent crude oil time-series data to predict the future oil prices based on genetic algorithm. (Shin et al., 2013) introduce semi-supervised learning approach to investigate the impact factors that affect the oil price movements, including OPEC and SAUDI oil production, USD exchange rates and Producer price index etc.

An abundant amount of studies is devoted to predict the oil returns and volatilities.

However, some underlying factors, such as investor sentiment, may act as potential causes of oil price changes and ﬂuctuations (Du et al., 2016; Qadan and Nama, 2018), which is hard to assess and calculate in empirical works due to non-quantization characteristic of market sentiment and reaction.

However, 一些潜在因子，譬如投资者情绪，难以量化。

Scholars have attempted to ﬁnd out appropriate proxies for the investor concerns and sentiments of ﬁnancial market. For example, Baker and Wurgler (2007) utilize a combination of ﬁnancial indices to quantify the investor sentiment in stock market, including stock trading volume, mutual fund ﬂows and IPO volume etc. Smales (2017) introduces CBOE Volatility Index (VIX) as a measure of investor fears and in- vestigates the relationship between VIX and stock returns. Kostopoulos et al. (2020) apply Google search volumes as a proxy for trading intensities of individual investors in German.

Scholars尝试构建investor concerns and sentiments的指标。

However, the previous measurements of investor sentiments show less effectiveness in providing untapped information for assets returns and volatilities prediction due to the following weakness (Li et al., 2019).

However, the previous measurements of investor sentiments show less effectiveness in provide untapped information.

First, ofﬁcial indices and statistics, such as transaction volume, are identiﬁed to provide less unexplored information about investor attentions, which is mainly due to its less consistency with the individual traders (Deng et al., 2012).

第一，official indices and statistics，比如交易量，难以提供unexplored information (untapped).

Second, the intensity and volume data of search engine contain too much investors-irrelevant noise (Limnios and You, 2018). As a result, the sentiment indicators calculated by search indices may show less ef- fectiveness and conﬁdence level in ﬁnancial assets prediction.

第二，the data of search engine包括太多无关噪音。

Natural Language Processing (NLP) techniques and big available dataset provide a novel framework for investor sentiment indicator constructions. By crawling the news headlines from hubs and websites for energy news, news dataset of crude oil can be tokenized. Utilizing the headline documents, daily investor sentiments are scored generated based on vector space models (Salton et al., 1975). Finally, returns and volatilities of crude oil are predicted by incorporating the daily polarity score of market sentiment.

NLP techniques and big dataset provide a novel framework for investor sentiment indicator constructions.

Sentiment index based on news headlines has the following advantages: First, news headlines reﬂect key information of investor attention, which can be measured and obtained efﬁciently through NLP techniques.

优点：news headlines reflect key informantion of investor attention

Second, sentiment index calculated by news headline contains less noise and irrelevant information, which is helpful to improve the reliability of indicator construction (Nassirtoussi et al., 2015).

无关噪音少

In this paper, we formally investigate the impact of news sentiment on oil futures returns and volatility by an event- based method and GARCH model estimations. Overall, the daily investor sentiment of crude oil is computed by NLP technique in this paper and act as a novel predictor for crude oil future returns and volatilities.

Several types of forecasting methods have been applied to oil future returns and volatility prediction by previous works, such as econometric models (Klein and Walther, 2016) and machine learning approaches (Yu et al., 2008; Tang et al., 2015; Yu et al., 2017).

此前多种方法应用于oil future returns and volatility prediction.

However, the econometric or machine learning typed predictors achieve inferior forecasting performance in comparison with the newly introduced deep learning approach (Mallqui and Fernandes, 2018).

但是，其效果皆inferior to deep learning approach

Utilizing the artiﬁcial neural networks consisting of multiple hidden layers, deep learning model shows superior time-series data predictability over its counterparts (LeCun et al., 2015). In recent years, deep learning has been applied broadly in crude oil time-series data prediction. For example, Zhao et al. (2017) apply a novel stacked denoising autoencoders (SDAE) for crude oil forecasting based on a large dataset of exogenous inﬂuencing parameters. Luo et al. (2019) employ a novel convolutional neural net- works (CNN) model to improve the short-term prediction performance for crude oil market.

列举一些deep learning literature

Since crude oil market returns and volatilities are non-stationary time-series data and consistent with complex inﬂuencing factors, the prediction accuracies of the proposed models may suffer due to the

high volatilities. In recent studies, a novel ensemble forecasting method, namely “Decomposition and Ensemble”, has been developed to handlethe task of irregular and non-stationary time-series data prediction (Bergmeir et al., 2016; Risse, 2019). This method decomposes the original time-series data into several stationary cycles, which can be estimated by forecasting models individually and ﬁnally integrated to generate the forecasting output. Among all the decomposition approaches, empirical mode decomposition (EMD) typed method is the predominant approach utilized in current empirical works (Wen et al., 2017; Santhosh et al., 2019).

EMD: 解决因收益时间序列、波动时间序列high volatilities造成的poor prediction accuracies.

However, the prediction error term may accumulate during the combination process of individual decomposed data forecasting, which is considered to reduce the prediction accuracies (Tang et al., 2015). In addition, EMD typed models may also give rise to the mode-mixing problem, which may probably produce the oscillations with similar scales in single decomposed factors (Colominas et al., 2014).

EMD typed models可能导致mode-mixing problem.

Based on the above studies, this paper develops a novel VMD- BiLSTM model with investor sentiment indicator for crude oil forecasting.

First, we extract the qualitative information from crude oil news headlines and conduct sentiment analysis on the contextual data, which provides effective and unexplored information for deep learning forecasting.

第一, qualitative information is extracted from news headline，其可以explore untapped information.

Moreover, we adopt an event-based method and GARCH model to investigate the impact of news sentiment on returns and volatility.

并且，event-based method and Garch model are adopted to investigate the impact ot news sentiment 对于returns volaitilities.

Second, a non-recursive signal decomposition method, namely variational mode decomposition (VMD), is applied to decompose the historical crude oil return and volatility data into various intrinsic modes. Compared to the predominant decomposition approach EMD, VMD is tested to avoid the mode-mixing problem effectively (Dragomiretskiy and Zosso, 2014).

第二，VDM is applied to decompose 历史原油收益和波动into various intrinstic modes, which can avoid the mode-mixing problem.

Third, a bidirectional long short- term memory neural networks (BiLSTM) is introduced as the deep learning prediction model that integrates both the qualitative and quantitative model inputs. The proposed BiLSTM model can extract a two- way sequential relationship in the time series data.

第三，BiLSTM can integrate both the qualitative and quantitative model inputs. The proposed BiLSTM model can extract a two-way sequential relationship.

According to our empirical results, we ﬁnd the shock of news sentiment signiﬁcantly causes the ﬂuctuation of oil futures prices. Speciﬁcally, oil futures prices react positively around positive news shocks, and present relatively weak decline surrounding negative news shocks. According to the estimations of GARCH models, we ﬁnd that news sentiment has an asymmetric impact on the volatility of oil futures. As for oil return and volatility forecasting, the incorporation of news index is always helpful for improving the forecasting performances in all benchmark scenarios. Speciﬁcally, our proposed data-decomposition based deep learning model is more effective than several econometric and machine learning models.

The major contributions of this paper may lie in that, to the best of our knowledge, this is the ﬁrst paper to incorporate the sentiment index of oil market based on NLP technique for oil future returns and volatilities prediction, which serves as an initial attempt to improve the forecasting results utilizing the hidden and effective information of irrational behaviors in the crude oil market.

Furthermore, we empirically conﬁrm the effectiveness of our proposed hybrid deep learning models for oil return and volatility forecasting. Our proposed model outperforms several benchmark econometrics, machine learning models, deep learning models and hybrid learning models. The methodology and empirical results presented by our study shed new light on risk controls of oil-related assets based on large-scale online datasets and data- driven approaches.

The rest of this paper is arranged as follows: Section 2 presents the research framework, news text analysis methods and forecasting models; Section 3 tests the impact of news sentiment on oil returns and volatility based on an event-based method and the estimation of GARCH models; Section 4 presents the empirical results of oil returns and volatility forecasting, including several robustness tests. Finally, the concluding remarks and future directions are concluded in Section 5.

2、Methodology

2.1 Research Framework

The forecasting approach proposed in this study aims to utilize qualitative information extracted from ﬁnancial news headlines and quantitative information extracted from market time series data to improve the return and volatility forecasting accuracy in the crude oil futures market.

The framework of our proposed approach is shown in Fig. 1. Speciﬁcally, there are ﬁve major steps, namely data collection, data pre- processing, sentiment analysis, data decomposition, as well as returns and volatility forecasting. These steps are explained in detail in Sections 2.2–2.5.

2.2 Data collection and preprocessing

For this study, we collected two different datasets separately: crude oil futures price data and news headlines. In terms of the crude oil price dataset, the Brent (LCO) crude oil daily futures contract closing prices are retrieved from Investing.com, for the time period from January 4, 2010 to September 17, 2019. In terms of the news headlines dataset, all the available news data related to “Crude oil” from oilprice.com, which is one of the largest hubs for energy news in the world with over 100,000 daily visitors, for the same time period as the crude oil news headline data. Instead of using full news articles in the analysis, we use news headlines due to several advantages: ﬁrst, news headlines can provide a sufﬁcient summary of the key news information; second, news headlines contain much less repetition and fewer irrelevant words than the news article itself (Nassirtoussi et al., 2015).

We ﬁrst preprocess the raw news headlines dataset using tokenization to convert all headlines into lower cases, and to remove stop words and punctuations.

转小写，去除停用词，标点

Stop words are the most common words in a language, such as “the”, “a”, “on”, “all” and “is”. Since stop words, along with punctuations, do not carry important information re- lated to the text, they are removed during preprocessing.

After removing stop words and punctuations, the “bag-of-words” approach is then employed to transform new texts into vectors. In this approach, each document (news headline) is represented by a vector, and each word within the document represents an element in the vector.

each news headline is equal to a document. 向量

每一个标题的每一个词语代表向量中的一个元素

The length of each vector is determined by the number of distinct words in the corresponding news headline in the dataset.

向量长度 = the number of distinct words in corresponding news headline

In this study, we also use a commonly used weighting technique, namely Term Frequency- Inverse Document Frequency (TF-IDF), in the vectorization process to evaluate the importance of a word to a speciﬁc document in a collection of documents. The importance of the word increases proportionally with the number of times it appears in the document, but decreases with the

number of documents that contain the word in the collection. Speciﬁcally, the TF-IDF score of word x in a document is calculated as follows in Eq. (1):

计算TF*IDF：评估某个词语的相对重要性。

In terms of the crude oil price data, we select the daily returns of the

Brent crude oil futures contracts as well as the 7-day volatility as the prediction targets.

日度对数收益率、七天平均波动率：prediction targets.

正交化

2.3 Sentiment analysis

In this study, we employ the Sentimentr package in R to calculate the sentiment of each processed news headline. The Sentimentr package returns the polarity score in the range of [−1.0, 1.0] for each document.

The news is considered as positive news if its polarity score is above zero, otherwise, it is considered as negative news. In general, the more negative the polarity score, the more negative the news; the more positive the polarity score, the more positive the news.

As pointed by previous studies, news often has a rather continuous effect on the investor's sentiment in the actual futures market (Akhtar et al., 2013). That is to say, the public sentiment on a speciﬁc day is shaped by the combination of news on the day and that in previous few days. However, the more recent news is more inﬂuential than the old news. Considering this situation, we formulate a cumulative senti- ment score (CSS) following Kiritchenko et al. (2014) and Chowdhury et al. (2014). In this study, we assume any piece of news will have a signiﬁcant impact on the investor sentiment for seven days, and that its impact exponentially declines each day after its release, which is consistent with the actual situation of news impact (Huang et al., 2014).

2.4 Data decomposition

According to previous literature, decomposing the original time series data into sub-series modes with different economic implications can help the neural networks capture its tendency and cyclicity (Wang et al., 2014). In this study, we employ variational mode decomposition (VMD) in the data decomposition process for the daily returns and 7-day volatility time series of Brent crude oil. In general, VMD is a non-recursive optimization technique that decomposes the original input signal f(t) into a series of discrete and stationary intrinsic modes uk through Wiener ﬁltering and Hilbert transform (Liu et al., 2016). The optimization procedure is as follows (Zhang et al., 2017):

Step 1: Calculate the Hilbert transform of each mode uk and transform into respective uni-sided frequency spectrum.

Step 2: Alter the frequency spectrum of each mode uk to narrow frequency baseband

Step 3: Conduct the H1 Gaussian smoothness on the demodulated signal to obtain the bandwidth of each mode uk.

The optimal solution is obtained using the alternative direction method of multipliers (ADMM) (Hestenes, 1969) and the original input signal f(t) is decomposed into K intrinsic modes.

2.5 Deep learning forecasting model: BiLSTM

【文献阅读】The role of news sentiment in oil futures returns and volatility forecasting相关推荐

文献阅读总结：网络表示学习/图学习
本文是对网络表示学习/图学习(Network Representation Learning / Graph Learning)领域已读文献的归纳总结,长期更新. 朋友们,我们在github创建了一个 ...
细胞亚器文献阅读之酵母液泡与线粒体的动态互作A Dynamic Interface between Vacuoles and Mitochondria in Yeast
细胞亚器文献阅读之酵母液泡与线粒体的动态互作A Dynamic Interface between Vacuoles and Mitochondria in Yeast 本文和前一篇阅读的文献之间的关 ...
细胞亚器文献阅读二~An ER-Mitochondria Tethering Complex Revealed by a Synthetic Biology Screen
细胞亚器文献阅读二~An ER-Mitochondria Tethering Complex Revealed by a Synthetic Biology Screen 通过合成生物学筛选ER和Mi ...
文献阅读_Document-Level Event Argument Extraction by Conditional Generation
前言小白读论文文献阅读汇总 Proceedings of the 2021 Conference of the North American Chapter of the Association ...
四位科研牛人介绍的文献阅读经验
每天保持读至少2-3 篇的文献的习惯.读文献有不同的读法,但最重要的自己总结概括这篇文献到底说了什么,否则就是白读,读的时候好像什么都明白,一合上就什么都不知道,这是读文献的大忌,既浪费时间,最 ...
最大熵模型（Maximum Entropy Model）文献阅读指南
最大熵模型(Maximum Entropy Model)是一种机器学习方法,在自然语言处理的许多领域(如词性标注.中文分词.句子边界识别.浅层句法分析及文本分类等)都有比较好的应用效果.张乐博士的最大 ...
条件随机场（Conditional random fields，CRFs）文献阅读指南
与最大熵模型相似,条件随机场(Conditional random fields,CRFs)是一种机器学习模型,在自然语言处理的许多领域(如词性标注.中文分词.命名实体识别等)都有比较好的应用效果.条 ...
文献阅读疑问(202010)
学习笔记,仅供参考文章目录文献阅读疑问 Unsupervised Deep Embedding for Clustering Analysis 文献阅读疑问 Unsupervised Deep E ...
那些文献阅读能力爆表的科研学子，都在偷偷做这件事……
对于广大科研学子来说,阅读文献这件事可谓是贯穿整个学术生涯,因为文献是了解现在所学专业的领域切入点,且做科研遇到难题时还可以在文献中寻找答案. 以及科研实验完毕后,若是准备发表论文,那么还得再看看文献 ...

【文献阅读】The role of news sentiment in oil futures returns and volatility forecasting