Abstract:
A method of identifying topics in a corpus that includes a plurality of text-based items begins by extracting keytext from each of the plurality of text-based items, resulting in sets of keytext. The method continues by processing the keytext sets to generate a respective semantic footprint for each of the text-based items, resulting in a plurality of semantic footprints. The semantic footprints are used to calculate similarity values for the text-based items, wherein the similarity values indicate commonality between pairs of the text-based items. The method continues by clustering the text-based items into a number of topic groups, wherein the clustering is influenced by the similarity values, and by generating a topic heading for each of the number of topic groups, resulting in a number of topic headings. Next, the text-based items are grouped into accessible topic groups associated with the topic headings.