Corpora

Corpora, as what the scientific community terms, is a large body of structured text. Big data is useless unless we can apply real use cases to it. One of the best way to extract useful information from a body of text is through corpora comparison. A body of text is checked against a baseline, often for the frequency of word counts. The baseline can be a contrasting category or a generic English dictionary.

Detecting social trends in global public sentiments between Cancer and Depression

Cancer and depression are both important matters of the heart to many.

Deriving structured insights from public sentiments should useful to public health practitioners who want to ride on trends to better serve the public needs.

For this work, tweets with either ‘Cancer’ or ‘Depression’ as queries were obtained from Twitter in 2018. Tweets which had missing or nonsensical data were filtered out, leaving us a total of 165,900 tweets to work with. We are interested in knowing which words are more characteristic of Cancer or Depression than the other.

Packages used are:
Scattertext corpora visualizer tool by Jason Kessler. This a corpora characteristic visualizer.
spaCy: Natural Language Processing with Python and Cython. This is an integrated syntactic parser and a neural network model.
Empath: For analyzing text across lexical categories. Useful to label categories to a basket of words.

We aggregated the tweets into a panda data frame, labelled them as representative of either Cancer or Depression. We took out terms that occurred lesser than 30 times, to filter away the noise and stop words. We use the spaCy parser to break down the tweets and words into its components.

# Turn the tweets into a Scattertext Corpus
nlp = spacy.load('en')
corpus = st.CorpusFromPandas(df, 
         category_col='category', 
         text_col='text',
         nlp=nlp).build()

The difference in characteristics of terms between Cancer and Depression from global tweets in 2018

Html file here (37mb) Go open it up in Google Chrome/ Firefox and give it a few minutes to load. It is quite a large complex file to render. You can mouse-over the scattered plots to retrieve more information on terms and themes.

This is the scatterplot visualisation of terms associated with Cancer and Depression. Top right quadrants reflect terms that most occurred in both Cancer and Depression, mainly anxiety and battling. Top left reflects terms frequent in Cancer and not Depression like chemo and sunscreen. Bottom right reflects terms that are frequent in Depression and not Cancer like seasonal depression. Bottom left is for the sparse, less relevant terms.

Terms that are most characteristic of Cancer are pancreatic, childhood cancer and hpv. Terms that are most characteristic of Depression are In Music we trust, mind charity, seasonal depression.

For a practitioner, what should be most interesting to deep dive at first glance are:

  1. For charities tagged #inmusicwetrust, #mindcharity, #donating 50 which are trending right now. What are these charities doing right to trend, should we learn from what they are doing in our own fundraising efforts?
  2. Seasonal depression is recurring. Why is there a surge of interest in this? Are people out there recently sad and doubtful? Should we quickly launch public education efforts before more individuals inflict self-hurt?
  3. Pancreatic cancer and childhood cancer are trending. Why are more people taking notice of this? Is there greater awareness or is it a sign of the public calling for help?

Term comparison between categories

For each term of interest, for example Anxiety, we are able to see what constitute the tweets. Anxiety is very much overlapped in both Cancer and Depression and is a common battle faced by the groups.

Next, we group the terms in categories rather than just understanding unigrams (one-word) and bigrams (two-words) by themselves. We create a corpus of extracted categories through the empath package.

The difference in extracted categories between Cancer and Depress from global tweets in 2018

Html file here (59mb) Similarly, give it a few minutes for the page to load and render.

From the scatterplot above, we are able to derive some hidden insights. Cancer is more categorised by family, school, work and money – this is true. Cancer and the treatment of cancer has a big burden on a patient’s lifestyle, family. It is also an expensive illness. As a solution care provider, mitigative steps should be taken to address these salient pain points. For depression, the impact is mainly on the multitude of emotions and suffering felt. It seems to be a more dormant, haunting illness unless Cancer which can be readily observed. This makes depression patients even more at risk for it is hard to be visually identified. These findings can help shape and support the direction of depression education.

Using web intelligence to solve real-world problems

Collecting and visualising the data is a small part of the work. How to use and apply the data is what matter most. The value lies in being able to structure effective campaigns to address the case at hand. From data, we hypothesise and then validate the findings with more data.

Corpora comparison can be done in a few main dimensions to extract information:

  • Basket of terms A vs Basket of terms B. See what sets them apart.
  • Basket of terms A vs A general baseline dictionary. See what are the featured terms of Category A.
  • Categories A vs Categories B. Compare the bigger categorised picture.
  • Basket of terms / Categories over time change. See how trends shifted.

To seek out more useful information, more nuanced iteration of this work can be done:

  • Time slice and bucket the terms over time. See how public opinions have trended.
  • Meta targeting. Filter down to geographical origins, age, gender to perform cross-sectional analysis of public opinions. Analyse and address these sub-segments accordingly.
  • Mine targeted media – go to localized health advice forum, facebook page of a localized health magazine, see what the sub-segments are talking about