No, you are not alone.Google also made this big mistake with AI


Just last month, shared an article This suggests that more than 30 percent of the data Google uses for one of its shared machine learning models is mislabeled as bad data. Not only is the model itself full of errors, but the actual training data used by the model itself is also full of errors. How could anyone using Google’s model ever hope to trust the results if the results are riddled with human errors that computers can’t fix. Google isn’t the only company with significant data mislabeling, 2021 MIT Research Nearly 6 percent of images in the industry-standard ImageNet database were found to be mislabeled, in addition to “label errors in the test set of the 10 most commonly used computer vision, natural language, and audio datasets.” How can we hope to trust or use these models if the data used to train them is so bad?

The answer is that you can’t trust the data or those models. As AI develops, garbage in is absolutely garbage out, and AI projects are suffering from serious bad data garbage. If Google, ImageNet and others made this mistake, you must have made this mistake too. Cognilytica research shows that more than 80% of AI project time is spent managing data, from collecting and aggregating it to cleaning and labeling it. Even with this amount of time, errors are bound to occur, provided the data is of good quality in the first place. Bad data equals bad results. This has been the case for all kinds of data-oriented projects for decades, and now it’s also a major issue for AI projects, basically just big data projects.

Data quality is more than just “bad data”

Data is at the heart of artificial intelligence. It is not program code that drives AI and ML projects, but the data from which learning must be derived. Many times, organizations’ AI projects progress so quickly that they realize later that their poor data quality is causing their AI systems to fail. If your data is of poor quality, don’t be surprised when your AI project suffers.

Data quality is more than just “bad data” such as incorrect data labels, missing or wrong data points, noisy data, or low-quality images. Major data quality issues also arise when you acquire or merge datasets. They also appear when data is captured and augmented with third-party datasets. Each of these operations and more introduces many potential sources of data quality issues.

Of course, how did you become aware of the quality of your data before you started your AI project? It’s important to pre-assess the state of your data, and you don’t move forward with your AI project, only to realize it’s too late that you don’t have the high-quality data your project needs. Teams need to figure out their data sources, such as streaming data, customer data, or third-party data, and then how to successfully merge and combine data from these disparate sources. Unfortunately, most data is not in a good, good usable state. You need to remove extraneous, incomplete, duplicated or otherwise unavailable data. You also need to filter this data to help reduce bias.

But we’re not done yet. You also need to consider how the data must be transformed to meet your specific requirements. How will you implement data cleansing, data transformation and data manipulation? Not all data is created equal and you will experience data decay and data drift over time.

Have you considered how to monitor this data and evaluate it to ensure the quality remains at the level you need? If you need labelled data, how do you get it? There are also possible data augmentation steps to consider. If you need to do additional data augmentation, how will you monitor it? Yes, quality data involves a lot of steps and these are all aspects you need to consider to make your project a success.

Specifically, data labeling is a common area where many teams struggle. For supervised learning methods to work, they need to have good, clean, well-labeled data so that it can learn from examples. If you’re trying to identify images of ships in the ocean, you need to feed the system good, clean, and well-labeled images of ships to train your model on. This way, when you feed it an image it has never seen before, it gives you a high degree of certainty whether there is a boat in the image. If you just train your system with a boat in the ocean on a sunny day, with no cloud cover, how would the AI ​​system see a boat or a boat with 50% cloud cover at night? reaction? If your test data doesn’t match real-world data or real-world scenarios, then you have a problem.

Even though teams spend a lot of time making sure their test data is perfect, the quality of training data often doesn’t reflect real-world data. in the public record For example, AI industry leader Andrew Ng discussed in his project with Stanford Health how the quality of data in his test environment did not match the quality of medical images in the real world, arguing that his AI models were useless outside of the test environment . This caused the entire project to essentially stall and fail, jeopardizing millions of dollars and years of investment.

planning project success

All these data quality-centric activities can seem overwhelming, which is why these steps are often skipped. But of course, as mentioned above, bad data can kill AI projects. So not paying attention to these steps is the main reason why the whole AI project fails.This is why organizations are increasingly adopting best practice methods such as CRISP-DM, agile and CPMAI Make sure they don’t miss or skip key data quality steps that can help avoid AI projects failing.

The problem of teams often moving on without planning for the success of the project is all too common. In fact, the second and third phases of the CRISP-DM approach and CPMAI are “Data Understanding” and “Data Preparation”. These steps even precede the first step in building the model and are therefore considered best practices for those AI organizations seeking to succeed.

In fact, if Stanford medical programs had adopted CPMAI or something similar, they would have realized that data quality issues would bog down their programs before the multi-million dollar and years mark. While it may be comforting to realize that even celebrities like Andrew Ng and companies like Google have made serious data quality mistakes, you still don’t want to be part of that club unnecessarily and let data quality matter Troubled your AI project.



Source link

Leave a Reply

Your email address will not be published.