
74 4. CHALLENGING PROBLEMS OF FAKE NEWS DETECTION
ing; next, the user comment encoder component illustrates the comment latent feature extrac-
tion through word-level attention networks; then, the sentence-comment co-attention compo-
nent models the mutual influences between the news sentences and user comments for learning
feature representations, and the explainability degree of sentences and comments are learned
through the attention weights within co-attention learning; finally, the fake news prediction
component shows the process of concatenating news content and user comment features for
fake news classification.
News Contents Encoding As fake news pieces are intentionally created to spread inaccurate
information rather than to report objective claims, they often have opinionated and sensational
language styles, which have the potential to help detect fake news. In addition, a news document
contains linguistic cues with different levels such as word-level and sentence-level, which provide
different degrees of importance for the explainability of why the news is fake. For example,
in a fake news claim “Pence: Michelle Obama is the most vulgar first lady we've
ever had,” the word “vulgar” contributes more signals to decide whether the news claim is
fake rather than other words in the sentence.
Recently, researchers find that hierarchical attention neural networks [177] are very prac-
tical and useful to learn document representations [24] with highlighting important words or
sentences for classification. It adopts a hierarchical neural network to model word-level and
sentence-level representations through self-attention mechanisms. Inspired by [24], we learn
the news content representations through a hierarchical structure. Specifically, we first learn the
sentence vectors by using the word encoder with attention and then learn the sentence repre-
sentations through sentence encoder component.
Word Encoder We learn the sentence representation via a RNN based word encoder. Al-
though in theory, RNN is able to capture long-term dependency, in practice, the old mem-
ory will fade away as the sequence becomes longer. To making it easier for RNNs to capture
long-term dependencies, GRU [27] are designed in a manner to have more persistent memory.
Similar to [177], we adopt GRU to encode the word sequence.
To further capture the contextual information of annotations, we use bidirectional
GRU [8] to model word sequences from both directions of words. e bidirectional GRU con-
tains the forward GRU
!
f which reads sentence s
i
from word w
i1
to w
iM
i
and a backward GRU
 
f which reads sentence s
i
from word w
iM
i
to w
i1
:
!
h
it
D
!
GRU.w
it
/; t 2 f1; : : : ; M
i
g
 
h
it
D
 
GRU.w
it
/; t 2 fM
i
; : : : ; 1g:
(4.28)
We then obtain an annotation of word w
it
by concatenating the forward hidden state
!
h
it
and
backward hidden state
 
h
it
, i.e., h
it
D Œ
!
h
it
;
 
h
it
, which contains the information of the whole sen-
tence centered around w
it
.