Reliable end-to-end text recognition in broadcast video data

Reliable end-to-end text recognition in broadcast video data

Automated multimedia content analysis is an application field of rapidly growing importance due to the steadily increasing amounts of digital data and the significance of extracted content for a wide range of customers such as telecommunication organizations, financial services, information management industry as well as non-profit organizations and governing bodies. Automated monitoring provides the data 24×7 in real-time required as input for human decision makers to take action based on the provided situational awareness data. Applications include analysis of the competitive market situation, reputation management, crisis management and many others.

Within the K-project Vision+ researchers from AIT together with the industrial partner eMedia Monitor, a leading provider of automated media monitoring solutions and services, have laid the foundation for an automated end-to-end (from image to text) text recognition system. The algorithms have been validated on large real-world data sets and the obtained recognition results suggest that all text containing image frames can be efficiently retrieved and analyzed with a high accuracy. Encountered scientific challenges involve the representational and segmentation aspects centered on the questions: (i) which visual features can informatively describe text along with its variations (size, font, spacing, color) and (ii) how to accomplish reliable segmentation in presence of clutter while maintaining high computational speed even for high-resolution images. Currently achieved results exhibit a promising quality implying that visual and audio/speech information fusion – to be investigated in a later phase of the Vision+ project – can be successfully integrated into a large-scale, real-time broadcast multimedia analysis system.


Samples from real-world TV frames with detected text regions indicated by blue rectangles:

Sample1 Sample2


Recognition results from real-world TV frames with detected text regions (green rectangle) and text (blue):

32_d_textbox 52_d_textbox
56_d_textbox 73_d_textbox


Parts of the developed scientific concepts have been described in the book chapter Real-Time Multimedia Policy Analysis of Using Video and Audio Recognition from Radio, TV and User-Generated Content in Advanced ICT Integration for Governance and Policy Modeling by IGI Global in 2014.