Follow me on Twitter to receive updates, free scripts and ongoing announcements!

Remove duplicates from file

Written by Codes Tips on March 27, 2009 – 8:33 am -

I will show you how you can remove word duplicates form a file with an efficient algorithm. Say you are having a large file with words that are line by line in a file on your disk. If you want to remove word or phrases duplicates from file you should first read the entire file into an array of strings, sort it and after that remove duplicate phrases and write back to the file.

Why we sort the array?

Because if we would not sort the array of strings then the complexity of a normal algorithm like, taking each phrase and compare to all other phrases, will have the O(N^2) complexity.

Tip: you should use QuickSort algorithm because if you have over 5000 phrases or words, the application will freeze. The complexity of the QuickSort algorithm is O(n*log(n)), so if you have saying 100.000 phrases, the complexity of the QuickSort will be 5000*log(5000) which is lower then 5000*5000, that a normal sorting algorithm uses to sort an array.

After sorting the array you should go through the array from the first phrase to the last element and compare on each step if the previous phrase is equal with the current phrase. If the phrase is not equal we write it to the output file.

This is the description of an efficient remove duplicates phrases algorithm.This function will work even for 1 million phrases or words. Here is the source code of the procedure:

 
 
procedure RemoveDuplicates(pathtofile:string);
var f:textfile;
    phrases:array[1..1000000] of string;
    nr,i,nrduplicates:integer;
    exists:boolean;
    s:string;
begin
   nrduplicates:=0;
   nr:=0;
   exists:=false;
      assignfile(f,pathtofile);
      reset(f);
        while not eof(f) do
          begin
            inc(nr);
            readln(f,phrases[nr]);
          end;
      closefile(f);
     QuickSort(phrases,1,nr);
     assignfile(f,pathtofile);
     rewrite(f);
     s:='';
      for i:=1 to nr do
        if strIcomp(pchar(s),pchar(phrases[i]))<>0 then
         begin
           writeln(f,phrases[i]);
           s:=phrases[i];
         end
         else
           inc(nrduplicates);
     closefile(f);
   showmessage(inttostr(nrduplicates)+' duplicates removed');
end;

Tags: , ,
Posted in Algorithms, Codes, Delphi |

Leave a Comment

You must be logged in to post a comment.


Plugintaylor.com - Plugintaylor and Custom Field