has associated two CSS styles – font-size: large and font-size: italic. Then n=2 and total CSS weight of term t is calculated by multiplying the weights of these two styles. CSS style weights are from interval (0,2). 3. Triplet for a term contains words, which need to be preprocessed on terms. This preprocessing is executed in the following steps:
stop words and number removal, lemmatization, synonym replacement,
stemming. This preprocessing is necessary because in term vector we will use preprocessed terms as a link to the triplet and its information. 3.2.2 Term vector in document neighbourhood Term vector is constructed in the following two steps: 1. At first, we run defined preprocessing on a natural text of the document content. A result from this process will be saved to the term vector (Table 2). Table 2. Term vector of base document neighbourhood.
HTML tag weight
total term weight
2. In this step we compute the total term weight using the following equation:
Peter Sládeček: Article Clustering with Usage of HTML Tags
where f(hi) is the weight of HTML tag hi and cssw(t) is the weight of term t based on the extracted CSS style (see equation (1) for more detail). To better explain the equation, we give the following example of an HTML source: