The problem of fuzzy duplicate detection of large texts

E.V. Sharapova; R.V. Sharapov

Отрывок: In other words, the text appears as a single long length string (an wide string). 3.3. Pre-processing of a text. It is perform pre-processing of a text: the text of a document is replaced by a filtered copy. For this purpose the following steps are performed: • Removal of HTML tags. A text document can to contain HTML tags. Because HTML tags are used for a document formatting, they do not affect to its contents. Therefore presence of HTML tags will just interfere ...

Полная запись метаданных

Поле DC	Значение	Язык
dc.contributor.author	E.V. Sharapova	-
dc.contributor.author	R.V. Sharapov	-
dc.date.accessioned	2018-05-22 10:06:42	-
dc.date.available	2018-05-22 10:06:42	-
dc.date.issued	2018	-
dc.identifier	Dspace\SGAU\20180518\69667	ru
dc.identifier.citation	E.V. Sharapova. The problem of fuzzy duplicate detection of large texts / E.V. Sharapova, R.V. Sharapov // Сборник трудов IV международной конференции и молодежной школы «Информационные технологии и нанотехнологии» (ИТНТ-2018) - Самара: Новая техника, 2018. - С.2565-2572.	ru
dc.identifier.uri	http://repo.ssau.ru/handle/Informacionnye-tehnologii-i-nanotehnologii/The-problem-of-fuzzy-duplicate-detection-of-large-texts-69667	-
dc.description	Основная статья	ru
dc.description.abstract	In the paper, we considered the problem of fuzzy duplicate detection. There are given the basic approaches to detection of text duplicates – distance between strings, fuzzy search algorithms without indexing data, fuzzy search algorithms with indexing data. The review of existing methods for the fuzzy duplicate detection is given. The algorithm of fuzzy duplicate detection is present. The algorithm of fuzzy duplicate texts detection was implemented in the system AVTOR.NET. The use of filtering text, stemming and character replacement, allow the algorithm to found duplicates even in minor modified texts.	ru
dc.language.iso	en_US	ru
dc.publisher	Новая техника	ru
dc.subject	fuzzy duplicate detecting	ru
dc.subject	fuzzy duplicate	ru
dc.subject	text	ru
dc.title	The problem of fuzzy duplicate detection of large texts	ru
dc.type	Article	ru
dc.textpart	In other words, the text appears as a single long length string (an wide string). 3.3. Pre-processing of a text. It is perform pre-processing of a text: the text of a document is replaced by a filtered copy. For this purpose the following steps are performed: • Removal of HTML tags. A text document can to contain HTML tags. Because HTML tags are used for a document formatting, they do not affect to its contents. Therefore presence of HTML tags will just interfere ...	-
Располагается в коллекциях:	Информационные технологии и нанотехнологии

Файлы этого ресурса:

Файл	Описание	Размер	Формат
The problem of fuzzy duplicate detection of large texts.pdf	Основная статья	168.43 kB	Adobe PDF	Просмотреть/Открыть

Показать базовое описание ресурса Просмотр статистики
Поделиться:

Все ресурсы в архиве электронных ресурсов защищены авторским правом, все права сохранены.

Репозиторий Самарского университета