12 Posts
You mention in your article using tools to help clean data. Are there any pitfalls to using certain tools? if so, how can one remedy or prevent these from occurring?
1 Replies
19 Posts
There can be pitfalls to using any tool. The biggest pitfall I've encountered is not understanding how the tool works.
Whether you are using regular expressions in Excel through Visual Basic for Applications (VBA) or you are using tools from the tidyverse in the R programming language, if you do not understand how the tools work (what inputs do they expect, what format, etc.) then you can easily make a mistake. For example, using the LEFT() function in an Excel spreadsheet may work fine as long as all off the target data is of the same length. If, on the other hand, you are expecting only 6-digit part numbers and there are a few that are 7 digits, you may misinterpret your results simply because your information is not completely accurate. I currently use model numbers that are 3 or 4 digits long. I am getting ready to add some new models that will increase the length to 5 for some products. If I do not account for that in my searches and analyses, then it could mean the difference between evaluating an old-style product versus the new release.
Whether you are using regular expressions in Excel through Visual Basic for Applications (VBA) or you are using tools from the tidyverse in the R programming language, if you do not understand how the tools work (what inputs do they expect, what format, etc.) then you can easily make a mistake. For example, using the LEFT() function in an Excel spreadsheet may work fine as long as all off the target data is of the same length. If, on the other hand, you are expecting only 6-digit part numbers and there are a few that are 7 digits, you may misinterpret your results simply because your information is not completely accurate. I currently use model numbers that are 3 or 4 digits long. I am getting ready to add some new models that will increase the length to 5 for some products. If I do not account for that in my searches and analyses, then it could mean the difference between evaluating an old-style product versus the new release.