The metadata strategy behind news search service Factiva

Factiva is a news search service from Dow Jones & Company that allows users to keep track of the latest developments in their industries. It was launched in 1999 after Dow Jones and news agency Reuters decided to combine their respective “current awareness” systems to increase their global reach.

Since then, the company has evolved the service to reflect the changing information consumption habits of its target customers, and the end-user expectations set by free online search engines such as Google. Throughout, its ability to collect, analyse and exploit metadata – descriptive information about the content within the stories it aggregates – has been crucial.

The metadata component of Factiva’s technology infrastructure was originally called Intelligent Indexing. The purpose was to improve search results by identifying key concepts within a news article, including companies, subjects, regions and industries. “The goal was to provide a balance of precision and breadth in our search results,” explains Greg Merkle, creative director at Dow Jones’ Enterprise Media Group.

To build a metadata repository for the many thousands of stories it indexes every day, the company deployed text analytics and categorisation software from a company called Inxight, a spin off from the Xerox PARC research facility in Silicon Valley. This would pick out the key concepts based on the syntax of the text, across multiple languages. Having identified the concepts mentioned in a specific story, industry standard metadata such as Dunns number for companies or SIC codes for industries would be associated with that story.

As text analytics technology has became more sophisticated, so too have the concepts it can be used to define. Today, Factiva can identify whether a news story relates to a change of management at a company or a bankruptcy, and can encode as much in the metadata.

Text analytics only goes so for, however, as the language of business is complex and constantly evolving. When it comes to potentially ambiguous words and phrases, it parses the information using rules defined by a human editor. “We might need a rule to identify that a story is about Apple, for example, and not apples,” explains Merkle.

When Factiva launched, users would have to define searches using metadata key words. Searches could be highly complex, with thousands of terms to articulate very precise results, but they could not be the simple free text search terms that are used in free web search engines.

The advent of those search engines has also had an effect on Factiva’s end user profile. Traditionally, only trained information professionals would have used the tool to do their research, but over time regular employees and executives got used to sourcing their own information.

“In 2003, we began developing an aggregation layer that would allow users to enter free text searches, and then navigate the results using the metadata,” says Merkle. “So if you searched for the New York Times, for example, it would ask you whether you meant the company or the news source. Depending what you clicked, it would then feed that back into your search.”

Navigation data

Once customers were using the metadata to navigate results, Factiva could analyse and expose the relationships between metadata concepts to suggest relevant information. For example, if a user searched for a particular company, the system could spot which other companies frequently occurred in stories with that company, and alert the user to stories about them.

Since then, Factiva has developed more sophisticated metadata analysis to present more timely and relevant information to the users. For example, by counting the number of stories that mention a given company and spotting when there is a sharp increase in that number, it can alert users to ‘trending’ companies in their industry.

Technologies are currently emerging that promise to greatly increase Factiva users’ ability to interrogate its metadata repository, Merkle says. Schemaless or graph databases, in which entities are represented as nodes and linked together on the basis of any relationship, can support search queries that would have been too computationally intensive in a relational database.

“This means you could ask for any mention of companies that claim they are innovative in their marketing material, have a market cap above a certain level and have a given number of employees in a particular region,” he explains.

As the amount of information that Factiva can derive from its metadata is increasing, Merkle says, its search results are evolving beyond a simple list of relevant stories. “We are developing situational solutions for specific business cases, such as supply chain analysis: if you have a list of supplier you rely on, you can do searches that not only show the news, but extract the facts from the stories and present them as interactive dashboard,” he says.

Having invested in its metadata strategy for Factiva, Dow Jones looks at the new breed of emerging data management technologies with glee. “It’s kind of a playground right now,” says Merkle. “We are at a very exciting time.”

Avatar photo

Ben Rossi

Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and...

Related Topics