Sitewide
RSS Feed:
|
By: Dr Fern Halper, Partner, Hurwitz & Associates Published: 17th March 2010 Copyright Hurwitz & Associates © 2010 |
I had an interesting briefing with the Basis Technology team the other week. They updated me on the latest release of their technology called Rosette 7. In case you're not familiar with Basis Technology it is the multilingual engine that is embedded in some of the biggest Internet search engines out there—including Google, Bing, and Yahoo. Enterprises and the government also utilize it. But, the company is not just about keyword search. Its technology also enables the extraction of entities (about 18 different kinds) such as organizations, names, and places. What does this mean? It means that the software can discover these kinds of entities across massive amounts of data and perform context sensitive discovery in many different languages.
An Example
Heres a simple example. Say you're in the Canadian consulate and
you want to understand what is being said about Canada across the
world. You type "Canada" into your search engine and get back a
listing of documents. How do you make sense of this? Using Basis
Technology entity extraction (an enhancement to search and a
basic component of text analytics), you could actually perform
faceted (i.e. guided) navigation across multiple languages. This
is illustrated in the figure below. Here, the user typed "Canada"
into the search engine and got back 89 documents. In the main
pane in the browser, you can see that an arrow in a number of
different languages highlights the word Canada, so you know that
it is included in these documents. On the left hand side of the
screen is the guided navigation pane. For example, you can see
that there are 15 documents that contain a reference to Obama and
another 6 that contain a reference to Barack Obama. This is not
necessarily a co-occurrence in a sentence, just in the document.
So, any of these articles would contain a reference to Obama and
Canada. This would help you determine what Obama might have said
about Canada. Or, what the connection is between Canada and the
BBC (under organization). This idea is not necessarily new, but
the strong multilingual capabilities make it compelling for
global organizations.
If you have eagle eyes, you will notice that the search on Canada returned 89 documents, but the entity "Canada" only returned 61 documents. This illustrates what entity extraction is all about. When the search for Canada was run on the Rosette Name Indexer tab (see upper right hand corner of the screen shot) the query searched for Canada against all automatically extracted "Canada" entities that existed in all of the documents. This includes all persons, locations, and organizations that have similar names. This included entities like "Canada Post" and " Canada Life" which are organizations, not the country itself. Therefore the 28 other documents with a Canada variant are organizations or other entities.
Use Cases
There are obviously a number of different use cases where the
ability to extract entities across languages can be important.
Here are three:
While this technology is not a text analytics analysis platform, it does provide an important piece of core functionality needed in a global economy. Look for more announcements from the company in 2010 around enhanced search in additional languages.
The messages above were all contributed by IT-Director.com readers. Whilst we take care to remove any posts deemed inappropriate, we can take no responsibility for these comments. If you would like a comment removed please contact our editorial team.
We automatically stop accepting comments 180 days after a post is published. If you would like to know more about this subject, please contact us and we'll try to help.
Published by: IT Analysis Communications Ltd.
T: +44 (0)190 888 0760 | F: +44 (0)190 888 0761