Jump to content

de-duping Search Results


Lukey_luke
 Share

Recommended Posts

Hi all,

 

Sometimes when we run searches, we need to de-dup the results at a file level. So if an attachment was attached to three different emails, we want to be able to de-dup the so we only see that attachment once.

Are workflow has been to export the results, to excel (fields BEGDOC and MD5) and remove duplicates and then seach for the BEGDOCS, but some of our databases are getting big and searching for 10k BEGDOCs at a time is too slow.

 

Ideally, we would just want to right click the search, select de-dup and have options such as "at Family Level" or "at File Level", then the result is split into two folders. Or even just have a tag group created based on the search name, then the two sub tags with the dups and non-dups tagged up.

 

Anyone got any tips?

 

Thanks,

Lukeyluke

Link to comment
Share on other sites

Hi all,

 

Sometimes when we run searches, we need to de-dup the results at a file level. So if an attachment was attached to three different emails, we want to be able to de-dup the so we only see that attachment once.

Are workflow has been to export the results, to excel (fields BEGDOC and MD5) and remove duplicates and then seach for the BEGDOCS, but some of our databases are getting big and searching for 10k BEGDOCs at a time is too slow.

 

Ideally, we would just want to right click the search, select de-dup and have options such as "at Family Level" or "at File Level", then the result is split into two folders. Or even just have a tag group created based on the search name, then the two sub tags with the dups and non-dups tagged up.

 

Anyone got any tips?

 

Thanks,

Lukeyluke

Link to comment
Share on other sites

  • 2 weeks later...
  • IPRO Employee

Hello Lukeyluke,

 

Thank you for posting in the Community.

 

Since Eclipse does not have straight deduplication functionality, the method you are using which is calculating the originals based on data in Excel, would be the best.

 

In this logic, it sounds like we are selecting only unique values for MD5Hash, and assuming that the first BegDoc that appears with the MD5Hash value is the original.

 

If this is becoming cumbersome in Excel and is taking awhile, there is an alternate solution using SQL Server that we can:

1. Group by MD5Hash value

2. Take the first DocId that appears with that MD5Hash value

3. Tag the document as original.

 

In this instance, you could then run a search on the 'Original' tag, whether it be equal to or not equal to.

 

To do this, follow the steps below.

  1. In the Eclipse case, create a tag that will be used to tag the Original documents.
  2. Access SQL Server Management Studio, and find the name of the Eclipse database.

    1. The database name can be found by creating a new query on the Nucleus/ADD database, and running the following query:
    2. SELECT Name, DatabaseName FROM Enterprise.CaseProductEnvironment WHERE ProductId = 3


[*]Once you have the database name, find it in the Object Explorer, right click the database, and click 'Create New Query'.

[*]Next, we will need to determine the unique ID number that is assigned to the Document Tag that we created in step 1.

  1. The TagID can be found by running the following query:
  2. SELECT TagId, TagName FROM dbo.Tags


[*]When you have located the TagId, you can then create a new query window and run the following query:

  1. DECLARE @EclipseDatabaseName nvarchar(max)
    	DECLARE @TagId nvarchar(max)
    	DECLARE @SQL nvarchar(max)
    
    	SET @EclipseDatabaseName = 'DBNAMEHERE' -- Eclipse Database name here
    	SET @TagId = IDHERE -- TagId here
    
    	SET @SQL = 'USE ' + @EclipseDatabaseName + '
    	IF OBJECT_ID(''TEMPDB..#InsertTags'') IS NOT NULL DROP TABLE #InsertTags
    	SELECT min(DocId) as DocId, ' + @TagId + ' AS TagID INTO #InsertTags 
    	FROM vDocumentFields
    	GROUP BY MD5HASH
    
    	INSERT INTO dbo.DocumentTags
    	SELECT * FROM #InsertTags
    	WHERE DocId NOT IN (SELECT DocId FROM dbo.DocumentTags WHERE TagID = ' + @TagId + ')
    
    	DROP TABLE #InsertTags'
    
    	exec (@SQL)
    


  2. Please Note: You will need to replace the information between the single quotes SET @EclipseDatabaseName = 'DBNAMEHERE' to the Eclipse Case database name that we retrieved in step 2. Additionally, this will also need to be done for SET @TagId = IDHERE. Please replace the IDHERE with the correct TagId.

[*]The query will have tagged the documents that have a DocId of the lowest value, grouped by MD5Hash value.

Please let me know if the logic you are using on your Excel calculations differ, and we can adjust the query.

 

 

 

 

  • Like 1
Link to comment
Share on other sites

  • IPRO Employee

Hello Lukeyluke,

 

Thank you for posting in the Community.

 

Since Eclipse does not have straight deduplication functionality, the method you are using which is calculating the originals based on data in Excel, would be the best.

 

In this logic, it sounds like we are selecting only unique values for MD5Hash, and assuming that the first BegDoc that appears with the MD5Hash value is the original.

 

If this is becoming cumbersome in Excel and is taking awhile, there is an alternate solution using SQL Server that we can:

1. Group by MD5Hash value

2. Take the first DocId that appears with that MD5Hash value

3. Tag the document as original.

 

In this instance, you could then run a search on the 'Original' tag, whether it be equal to or not equal to.

 

To do this, follow the steps below.

  1. In the Eclipse case, create a tag that will be used to tag the Original documents.
  2. Access SQL Server Management Studio, and find the name of the Eclipse database.

    1. The database name can be found by creating a new query on the Nucleus/ADD database, and running the following query:
    2. SELECT Name, DatabaseName FROM Enterprise.CaseProductEnvironment WHERE ProductId = 3


[*]Once you have the database name, find it in the Object Explorer, right click the database, and click 'Create New Query'.

[*]Next, we will need to determine the unique ID number that is assigned to the Document Tag that we created in step 1.

  1. The TagID can be found by running the following query:
  2. SELECT TagId, TagName FROM dbo.Tags


[*]When you have located the TagId, you can then create a new query window and run the following query:

  1. DECLARE @EclipseDatabaseName nvarchar(max)
    	DECLARE @TagId nvarchar(max)
    	DECLARE @SQL nvarchar(max)
    
    	SET @EclipseDatabaseName = 'DBNAMEHERE' -- Eclipse Database name here
    	SET @TagId = IDHERE -- TagId here
    
    	SET @SQL = 'USE ' + @EclipseDatabaseName + '
    	IF OBJECT_ID(''TEMPDB..#InsertTags'') IS NOT NULL DROP TABLE #InsertTags
    	SELECT min(DocId) as DocId, ' + @TagId + ' AS TagID INTO #InsertTags 
    	FROM vDocumentFields
    	GROUP BY MD5HASH
    
    	INSERT INTO dbo.DocumentTags
    	SELECT * FROM #InsertTags
    	WHERE DocId NOT IN (SELECT DocId FROM dbo.DocumentTags WHERE TagID = ' + @TagId + ')
    
    	DROP TABLE #InsertTags'
    
    	exec (@SQL)
    


  2. Please Note: You will need to replace the information between the single quotes SET @EclipseDatabaseName = 'DBNAMEHERE' to the Eclipse Case database name that we retrieved in step 2. Additionally, this will also need to be done for SET @TagId = IDHERE. Please replace the IDHERE with the correct TagId.

[*]The query will have tagged the documents that have a DocId of the lowest value, grouped by MD5Hash value.

Please let me know if the logic you are using on your Excel calculations differ, and we can adjust the query.

 

 

 

 

  • Like 1
Link to comment
Share on other sites

  • IPRO Employee

Hello Lukeyluke,

 

Thank you for posting in the Community.

 

Since Eclipse does not have straight deduplication functionality, the method you are using which is calculating the originals based on data in Excel, would be the best.

 

In this logic, it sounds like we are selecting only unique values for MD5Hash, and assuming that the first BegDoc that appears with the MD5Hash value is the original.

 

If this is becoming cumbersome in Excel and is taking awhile, there is an alternate solution using SQL Server that we can:

1. Group by MD5Hash value

2. Take the first DocId that appears with that MD5Hash value

3. Tag the document as original.

 

In this instance, you could then run a search on the 'Original' tag, whether it be equal to or not equal to.

 

To do this, follow the steps below.

  1. In the Eclipse case, create a tag that will be used to tag the Original documents.
  2. Access SQL Server Management Studio, and find the name of the Eclipse database.

    1. The database name can be found by creating a new query on the Nucleus/ADD database, and running the following query:
    2. SELECT Name, DatabaseName FROM Enterprise.CaseProductEnvironment WHERE ProductId = 3


[*]Once you have the database name, find it in the Object Explorer, right click the database, and click 'Create New Query'.

[*]Next, we will need to determine the unique ID number that is assigned to the Document Tag that we created in step 1.

  1. The TagID can be found by running the following query:
  2. SELECT TagId, TagName FROM dbo.Tags


[*]When you have located the TagId, you can then create a new query window and run the following query:

  1. DECLARE @EclipseDatabaseName nvarchar(max)
    	DECLARE @TagId nvarchar(max)
    	DECLARE @SQL nvarchar(max)
    
    	SET @EclipseDatabaseName = 'DBNAMEHERE' -- Eclipse Database name here
    	SET @TagId = IDHERE -- TagId here
    
    	SET @SQL = 'USE ' + @EclipseDatabaseName + '
    	IF OBJECT_ID(''TEMPDB..#InsertTags'') IS NOT NULL DROP TABLE #InsertTags
    	SELECT min(DocId) as DocId, ' + @TagId + ' AS TagID INTO #InsertTags 
    	FROM vDocumentFields
    	GROUP BY MD5HASH
    
    	INSERT INTO dbo.DocumentTags
    	SELECT * FROM #InsertTags
    	WHERE DocId NOT IN (SELECT DocId FROM dbo.DocumentTags WHERE TagID = ' + @TagId + ')
    
    	DROP TABLE #InsertTags'
    
    	exec (@SQL)
    


  2. Please Note: You will need to replace the information between the single quotes SET @EclipseDatabaseName = 'DBNAMEHERE' to the Eclipse Case database name that we retrieved in step 2. Additionally, this will also need to be done for SET @TagId = IDHERE. Please replace the IDHERE with the correct TagId.

[*]The query will have tagged the documents that have a DocId of the lowest value, grouped by MD5Hash value.

Please let me know if the logic you are using on your Excel calculations differ, and we can adjust the query.

 

 

 

 

  • Like 1
Link to comment
Share on other sites

  • IPRO Employee

As a side note, this is based on DocId. DocId is populated based on the order the documents are brought in. If there was a mix up of load files, such as BegDoc 0001 starts in Load file 1, BegDoc 0010 starts in Load file 2, and BegDoc 0020 start in Load file 3 -- and the loadfiles were loaded in the order of: Loadfile3, Loadfile1, Loadfile2, the BegDocs would not match the order of the DocIds.

Link to comment
Share on other sites

  • IPRO Employee

As a side note, this is based on DocId. DocId is populated based on the order the documents are brought in. If there was a mix up of load files, such as BegDoc 0001 starts in Load file 1, BegDoc 0010 starts in Load file 2, and BegDoc 0020 start in Load file 3 -- and the loadfiles were loaded in the order of: Loadfile3, Loadfile1, Loadfile2, the BegDocs would not match the order of the DocIds.

Link to comment
Share on other sites

  • IPRO Employee

As a side note, this is based on DocId. DocId is populated based on the order the documents are brought in. If there was a mix up of load files, such as BegDoc 0001 starts in Load file 1, BegDoc 0010 starts in Load file 2, and BegDoc 0020 start in Load file 3 -- and the loadfiles were loaded in the order of: Loadfile3, Loadfile1, Loadfile2, the BegDocs would not match the order of the DocIds.

Link to comment
Share on other sites

  • S.W.A.T. Engineer

This is fantastic information, Cody!

 

It is worth a further note, however, that as Cody mentioned, since Eclipse does not have any built-in deduplication functionality, the methods described above will not provide results that are as comprehensive as performing deduplication in a tool such as eCapture. We also would not recommend removing any documents based on the results from the operations above, but using the results as an aid to assist the review process.

Link to comment
Share on other sites

  • S.W.A.T. Engineer

This is fantastic information, Cody!

 

It is worth a further note, however, that as Cody mentioned, since Eclipse does not have any built-in deduplication functionality, the methods described above will not provide results that are as comprehensive as performing deduplication in a tool such as eCapture. We also would not recommend removing any documents based on the results from the operations above, but using the results as an aid to assist the review process.

Link to comment
Share on other sites

  • S.W.A.T. Engineer

This is fantastic information, Cody!

 

It is worth a further note, however, that as Cody mentioned, since Eclipse does not have any built-in deduplication functionality, the methods described above will not provide results that are as comprehensive as performing deduplication in a tool such as eCapture. We also would not recommend removing any documents based on the results from the operations above, but using the results as an aid to assist the review process.

Link to comment
Share on other sites

  • 2 weeks later...
  • 1 month later...
  • 5 months later...

We use the results of the duplicateof and duplicate locations fields generated from Allegro or Ecapture in combination with md5 hash. We typically do not delete attachments from families as a rule, however, we do identify emails that have duplicate attachments for the purposes of determining which email to produce in it's purest form (we try to avoid using emails that typically include other emails with attachments).

these fields operate on a doc level and not family level.

using this approach in combination with CAAT unique and redundant email threads and near duplicates searches allows us to filter down our collection prior to review.

Link to comment
Share on other sites

We use the results of the duplicateof and duplicate locations fields generated from Allegro or Ecapture in combination with md5 hash. We typically do not delete attachments from families as a rule, however, we do identify emails that have duplicate attachments for the purposes of determining which email to produce in it's purest form (we try to avoid using emails that typically include other emails with attachments).

these fields operate on a doc level and not family level.

using this approach in combination with CAAT unique and redundant email threads and near duplicates searches allows us to filter down our collection prior to review.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...