Jump to content

Sharmarke Afgab

Administrators
  • Posts

    252
  • Joined

  • Last visited

  • Days Won

    7

Everything posted by Sharmarke Afgab

  1. Welcome to Pages! Pages extends your site with custom content management designed especially for communities. Create brand new sections of your community using features like blocks, databases and articles, pulling in data from other areas of your community. Create custom pages in your community using our drag'n'drop, WYSIWYG editor. Build blocks that pull in all kinds of data from throughout your community to create dynamic pages, or use one of the ready-made widgets we include with the Invision Community. View our Pages documentation
  2. Welcome to your new Invision Community! Take some time to read through the Getting Started Guide and Administrator Documentation. The Getting Started Guide will walk you through some of the necessary steps to setting up your community. The Administrator Documentation takes you through the details of the capabilities of our platform. Go to the Documentation
  3. As you know, for non emails, we use the fixed size bit value of the file to generate hash. Each file contains blocks of data and therefore we use that data to generate a hash for the file. Emails however are a different animal. If I send an email to Chuck, there will be 2 copies of that email. The first will reside in my "sent" folder. The second will reside in Chuck's inbox. If both Chuck and I are Custodians in a case, our data will get collected and those 2 emails should be considered duplicates. However, emails have something called "internet header" which contains information regarding the email servers it passed through to get to the recipient. In this example, the email in Chuck's inbox, will contain information that is not in the email in my "Sent" folder. Due to this little known "Internet header" information, emails can never be hashed the same using the standard hashing method. Hence why we choose specific email fields and body to generate hash for emails.
  4. By: Chuck Holtzworth In most cases (projects), MD5 hash values are calculated on the file itself. For more reliable de-duplication of emails though, it is required that de-duplication occur on the information contained within it and not the file itself. There are many reasons for this; the simplest is the fact that when an email is saved out of its container (PST, NSF, etc) the file that is created contains information that would change the hash value of the same email each time that the email was saved out. Hashing of e-mails uses the UTC time to ensure proper deduplication across time zones. When an email is discovered within Ipro eCapture, it is assigned a hash value based on fields chosen by the user. The values of these fields are concatenated and the text is hashed. Select from the following email fields to generate the hash value: • Subject • From/Author • Attachment Count • Body Whitespace: From the Body Whitespace drop-down list, select either Include (default) or Remove. Whitespace in the e-mail body could cause slight differences between the same e-mails, which could result in different hashes being generated. Remove - removes all whitespace between lines of text in the e-mail body prior to hashing. Include - keeps the whitespace. • E-mail Date: The following message types use the specified date values: Outlook: Sent Date, IBM (formerly Lotus) Notes: Posted Date, RFC822: Date, and GroupWise: Delivered Date. • Attachment Names • Recipients • CC • BCC Select from either Creation Date or Last Modification Date. The selected value will be used when calculating the MD5 hash in the event that the normal E-mail Date value is not present. This commonly occurs for Draft messages that have not been sent. Start Time is always used if it exists. DataExtract/Processing: A comparison is made on the SHA1 hash value for parents and standalone documents at the time of de-duplication. If a match is found on a parent email a family hash is generated using the MD5 values for the entire family. These are concatenated and hashed resulting in an MD5 family hash. The resulting family hash is then used for de-duplication comparison. The comparison set of documents for de-duplication are those that are found within the DataExtractResults\Results tables within the client database and depending on the de-duplication settings specified via Flex Processor. • The Family hash value is stored in the ClientDatabase.Items.FamilyHash field. • Family hash is generated in SQL code. Streaming Discovery: Family hash is generated for all email parents where the Items.EmailAttachmentCount field is > 1. Family hash is always generated during Streaming Discovery just before an export set is created either by export interval or by manually exporting the documents. The MD5 values are concatenated and hashed resulting in a SHA1 family hash. The resulting family hash is then used for de-duplication comparison. The comparison set of documents for de-duplication are those that are found within the ExportSetSelection table within the client database. • The Family hash value is stored in the ClientDatabase.Items.FamilyHash field. • Family hash is generated using the Stored Procedure named 'CalculateFamilyHash' found in each client database
  5. Hello everyone, I wanted to post this email and a portion of the thread between Chuck Holtzworth (Sr. SWAT Engineer) and myself discussing hashing. I feel like this is a topic not discussed too often or in detail. Chuck captures our hashing logic in a way that we can all understand. Please stay and read the below thread.
  6. Hey Stephen, Are you still having this issue? I tried recreating this in my environment and was unable to. Please let me know so we can look into it.
  7. Hey Stephen, Are you still having this issue? I tried recreating this in my environment and was unable to. Please let me know so we can look into it.
  8. Avi, One other item that can contribute to count differences is the method of extraction of embedded items between the two engines. For example, standard discovery extracts more fembedded file types at the top level while streaming goes deeper and extracts within several layers of embedded files. The two methods are mostly similar in file type support but different in certain extraction methods.
  9. Avi, One other item that can contribute to count differences is the method of extraction of embedded items between the two engines. For example, standard discovery extracts more fembedded file types at the top level while streaming goes deeper and extracts within several layers of embedded files. The two methods are mostly similar in file type support but different in certain extraction methods.
  10. KKuntamukkala, The short answer for this question is as follows: The higher you set the OCR confidence the more documents will be flagged for QC and manual QC and OCR will be required after the job is completed. The lower this setting, the more OCR will be attempted on documents during the job. Below is a link to an article explaining this setting further: https://community.iprotech.com/articles/eclipse-aa/ecapture-aa/554-ecapture-ocr-confidence https://community.iprotech.com/articles/eclipse-aa/ecapture-aa/554-ecapture-ocr-confidence
  11. KKuntamukkala, The short answer for this question is as follows: The higher you set the OCR confidence the more documents will be flagged for QC and manual QC and OCR will be required after the job is completed. The lower this setting, the more OCR will be attempted on documents during the job. Below is a link to an article explaining this setting further: https://community.iprotech.com/articles/eclipse-aa/ecapture-aa/554-ecapture-ocr-confidence https://community.iprotech.com/articles/eclipse-aa/ecapture-aa/554-ecapture-ocr-confidence
  12. Great post Josiah. So the question is, what is a good size for this limit? How do you decide a document is just too big for the viewer and thus should be reviewed natively? There is never a good answer for this but one can come to a good size limit by considering the following facts: 1 character = 1 byte 1000 characters = 1kb 1 page of text typically holds 2kb of text Thus it will take about 500 pages to make 1mb Which means it takes about 5000 pages to make up 10mbs 5000 pages? Is that a lot? Depends on how good of a reader you are. If you're the type that gets through 5 large novels a day, than it should be just fine, as good paperback sci-fi novel can go anywhere between 300-500 pages. Does that seem overwhelming? Consider this, the bible is about 4 mbs and 1200 pages. So yeah, 10 mbs is just about 2.5 bibles. Yes but computers handle things better and it's technology so what is the big deal? Well, downloading something of that size takes a little longer and slows down your review. Here is my advice: You should set your size somewhere between 5-7 mbs and anything above that should be put in a smartfolder for native review. In fact you should cal the smartfolder "Review in Native". You may be wondering why I just stated the size of a bible and then advised you to download something larger than the size of the bible to be viewed natively. The reason is that when discussing size in character and page and bytes and kbs etc, we're only considering plain text and not pictures and all the cool formatting that comes with Microsoft Office documents. For example, if you type up a 500 page essay in MS Word it will probably be more than 1 mb as suggested above in my estimates. It will probably be somewhere around 3-4 mbs because MS Word has a ton of formatting overhead to add to the file. You know how complicated MS word can get with all its hidden features. Hence my advice to set the limit to somewhere between 5-6 mbs. And that is all I have to say about size limits for the native viewer.
  13. John, There is a defect in the PDF family export function currently and our Dev team is looking into it. Your ticket should be getting updated with a Development ticketID and further instructions on what the next steps would be. Meanwhile the response from the Support Engineer is a suggestion to a work around. I totally understand how the urgency is not very well conveyed in the ticket response but I can assure you there is urgency to get this issue resolved on our end. Please let reach out to support if you have further questions.
  14. John, There is a defect in the PDF family export function currently and our Dev team is looking into it. Your ticket should be getting updated with a Development ticketID and further instructions on what the next steps would be. Meanwhile the response from the Support Engineer is a suggestion to a work around. I totally understand how the urgency is not very well conveyed in the ticket response but I can assure you there is urgency to get this issue resolved on our end. Please let reach out to support if you have further questions.
  15. Hey John, It seems like something may have gone haywire with the export tables in SQL. This is one of those issue in which you would have to call Ipro Support so we can see where the issue is stemming from. Thanks for the heads up.
  16. Hey John, It seems like something may have gone haywire with the export tables in SQL. This is one of those issue in which you would have to call Ipro Support so we can see where the issue is stemming from. Thanks for the heads up.
  17. Sometimes you run out of SQL storage space and it is tempting and the easy option to shrink your databases. While this practice on a small scale, addresses your immediate needs, it is something that eventually becomes a zero sum game. Shrinking a data file causes index fragmentation, takes up resources, and ultimately leads to performance issues. It should not be part of your regular maintenance exercise as you get into a vicious shrink/grow cycle. If you shrink and you have log back up part of your maintenance plan, this information is backed up and thus takes up unnecessary space in your backup drive. As you process data or users perform additional work in your review case, the database will auto-grow which leads you back to the shrinking again. My advice: allocate the proper amount of storage based on usage, add more storage when needed, and STOP shrinking.
  18. Sometimes you run out of SQL storage space and it is tempting and the easy option to shrink your databases. While this practice on a small scale, addresses your immediate needs, it is something that eventually becomes a zero sum game. Shrinking a data file causes index fragmentation, takes up resources, and ultimately leads to performance issues. It should not be part of your regular maintenance exercise as you get into a vicious shrink/grow cycle. If you shrink and you have log back up part of your maintenance plan, this information is backed up and thus takes up unnecessary space in your backup drive. As you process data or users perform additional work in your review case, the database will auto-grow which leads you back to the shrinking again. My advice: allocate the proper amount of storage based on usage, add more storage when needed, and STOP shrinking.
  19. Hey Orlando, Sorry for the late response to this email as I was tied up in training today. The most logical route would be to do “near deduplication� to this subset of data so you can identify similarity percentages. Once completed, you can then construct a search similar to below: CAATNDSort ‘is not’ 1 AND CAATNDSimilarity = 99 This method, for the most part, ensures documents that should have been deduplicated get isolated.
  20. Hey Orlando, Sorry for the late response to this email as I was tied up in training today. The most logical route would be to do “near deduplication� to this subset of data so you can identify similarity percentages. Once completed, you can then construct a search similar to below: CAATNDSort ‘is not’ 1 AND CAATNDSimilarity = 99 This method, for the most part, ensures documents that should have been deduplicated get isolated.
  21. Hey Wade, Password should be entered consecutively as a list with a new line per password. There is no delimiter required and a CSV import is not necessary.
  22. Hey Wade, Password should be entered consecutively as a list with a new line per password. There is no delimiter required and a CSV import is not necessary.
  23. Hey Wade, Password should be entered consecutively as a list with a new line per password. There is no delimiter required and a CSV import is not necessary.
×
×
  • Create New...