1484
Topic Modeling to Aid Development of a CDC Mosquitoes Website
Topic Modeling to Aid Development of a CDC Mosquitoes Website
Background:
CDC is developing a comprehensive website about mosquitoes. The site centralizes content that, until now, was dispersed and difficult to find; traffic from search engines was low and did not rank well in search engine results for most queries. An important part of formative research and site design was understanding concerns of at-risk audiences and how those concerns are expressed. Usual profiling methods involve surveys, focus groups, and marketing research databases. We decided that collecting Google search engine queries and analyzing them for topical patterns and word content would offer a rich perspective. In data mining and analytics, this is known as topic modeling. Search queries, though short, were entered by searchers of their own volition and in their own time and words.Program background:
CDC is an international leader in researching, preventing, and controlling pathogens spread by vectors, including mosquitoes. Mosquito-borne disease epidemics are on the rise in the United States and US territories. People seeking information from public health experts visit CDC’s website to learn about mosquito-borne diseases, mosquito biology, bite prevention, and mosquito control.Evaluation Methods and Results:
The evaluation analyzed a large dataset of search queries from the Google Keyword Planner tool embedded in its Adwords advertising application, and from the Google Trends tool. The 885-search query dataset was gathered by entering successive sets of seed terms into the tool to uncover the broadest search query dataset related to mosquitoes. The dataset ranged from queries used thousands of times to those used infrequently. The dataset was evaluated for recurring thematic patterns and manually sorted into topic categories, each representing a broad topic of searcher concern. The categories were allowed to emerge on their own. The broad categories were sub-categorized. The resulting 11 topic sets were evaluated for recurrence of individual words, searcher concern(s), most-used search phrases, and common turns of phrase. The final report suggested ways to organize the site and label and present CDC information and messages. Report data completed our understanding of audiences’ specific concerns and needs. We used it to guide site architecture and navigation setup. Site topic sections, such as “Mosquito Control at Home” and “Mosquito Bites” were mapped to report data, and labeling used terms common in the data. User concerns captured in the data informed creation of specific content. A cross-CDC workgroup of stakeholders reviewed our draft site architecture and report, and refined the results into a site outline.Conclusions:
We found analyzing search queries provided rich evidence that helped us make data-driven decisions in several parts of our formative research and site design process.Implications for research and/or practice:
Topic modeling fits nicely into health communicators’ formative research evaluation toolbox. The method is straightforward, plumbs Google’s vast data pool, can be performed manually, and produces concrete results useful for multiple purposes. We look forward to efforts to automate parts of the modeling process to speed the process and offer additional tools and measures such as topic clustering.