Text Mining

Text mining

Text Mining in Marine Science

The number of scientific publications is growing at a fast rate and is estimated to double every nine years.

This makes it hard for scientists, researchers and policy makers to keep up with the literature, leaving little time to read outside their own field.

Text mining aims at processing large amounts of text to reveal hidden connections and patterns. Since understanding human language is such a hard task, text mining efforts have focused on extracting predetermined types of information, ignoring everything else in the text.

For instance, we can focus on identifying all the mentions of marine species in text. From the observation that two species are frequently mentioned together in the same sentence, we may deduce that they are probably associated. In addition, targeting a specific relation between species — like A eats B — would enables us to automatically construct food chains/webs from text.

Knowledge extracted through text mining can then be combined with existing knowledge from databases as well as automatic reasoning in order to discover new knowledge. For example, a system may suggest a new hypothesis connecting two or more phenomena – extracted from different articles —through a chain of causal relations (a process also known as Literature Based Knowledge Discovery (LBKD).

Of course, we still need humans to check and verify such hypotheses.

 Text Mining in Ocean-Certain includes:

  1. Information Retrieval – creating a large collection of preprocessed and indexed articles in Marine Science and related fields.
  2. Information Extraction – Automatically extracting change events, for example “pH level of surface water in the Arctic Ocean is rising”, “primary production has decreased”, etc. Moreover, causal relations and correlations between these change events are detected – for example, “adding iron causes phytoplankton growth”.
  3. Knowledge Discovery – combining extracted information from separate articles with background knowledge and reasoning algorithms to produce new hypotheses in the form of causal chains or  feedback loops. Background knowledge comes from ontologies and linked data, as well as modelling of domain expert knowledge.
  4. User Interface – Presenting this knowledge to end users as an interactive web application, which includes browsable network visualizations.

For more information and to check out a demo, please see http://baleen.idi.ntnu.no/