ML Image Gatherer

A python cli app that scrapes images from the web for training image classification ML algorithms.

Summary

ML Image Gatherer was created because of a need that arose within my Bird Identifier app. While Bird Identifier could attempt identifying photos, it really wasn't very accurate in a lot of cases. I needed some way to get lots of images that would let me train the image classification algorithm to be smarter. And so, ML Image Gatherer was born.

The app works by allowing users to pass a search query in through the command line, and then it will go to google images and scrape a number of images of that query and save them. Those images can then be vetted and consumed by a machine learning algorithm so it can learn to classify images.


Commands

There are two main commands in image gatherer: --query and --batch.

Query is the most basic command. It tells the app to go scrape images for a single subject.

Batch allows users to populate a text file with any number of queries, and then the app will scrape images for every single one of them. It utilizes multithreading to scrape a number of queries at the same time. This command greatly increases how effective the app is. In Bird Identifier, I wanted to train the algorithm how to identify hundreds of different bird species. Instead of entering each query individually, this allows me to simply populate a text file and let the app do all of the hard work.

Options

Each of the main commands has several options that allow the user to tweak the behavior. The options include:

  • --help: lists all of the possible commands and options.
  • --num: determines the number of images scraped. Accepts 1-100, and defaults to 10.
  • --path: determines where the images will be saved.
  • --threads: exclusively for batch. Determines the number of queries that are scraped concurrently.
  • --no-headless: allows the scraper to run in real mode, allowing the user to see the scraping in action.
  • --debug: enables debug logging.


How It Works

The app utilizes 2 libraries for most of its functionality: argparse and selenium.

Argparse is a python library that makes building CLI apps very easy. It handles all of the commands and options passed in through the terminal.

Selenium is a browser automation tool that I chose to use for webscraping. Because Image Gatherer targets Google for its images, many normal webscraping libraries wouldn't work because they don't interact with JavaScript and reactive websites. Selenium doesn't have this issue becaues it mimics a normal web browser, and it can load JavaScript pages. This means the app is a bit slower than other webscrapers, but it's a necessary evil.

When a user passes a command into the app, like so: py image-gatherer.py --query dog -num 100, all of the CLI arguments are parsed first. This command tells the app to scrape images for a single query, labelled "dog", and to get 100 images. The arguments are then passed to the webscraper, which creates an instance of Selenium that loads google and searches for the query.

Depending on how many images are needed, the scraper will scroll to the bottom of the page to load as many images as possible. After the images are loaded, Selenium selects each image sequentially (so we get the higher resolution version) and saves the [src] of each image. These images are then saved to a folder based on the user's path argument.

If the user selects the batch command, the app will first load all of the queries from the batch file. Then it creates a ProcessPoolExecutor that runs X number of queries at the same time, allowing the app to process them much more quickly than if it was sequential. Images for each individual query are labelled and saved to a query specific sub-folder, allowing for easy consumption by a ML algorithm.

Once the images are all downloaded, the user needs to check each of them to ensure they're appropriate for the training of the machine learning algorithm. Once that's done they can be easily imported by the ML algorithm for training!


Other Notes

The app has a few other useful features (at least for the developer). The debug option tells the app to save debug information to a log file. In addition to this, the app will take screenshots of the webscraper when it encounters errors, making debugging the app much easier, especially in headless mode.


Wrap Up

Image Gatherer really isn't a complicated app. It was designed to do one task and to do it well. Despite that simplicity, it taught me a lot about many different topics and I was able to add many more features than if it was more technically convoluted. I would love to add tests to this project to increase its reliability, but the call of a new proejct is too strong! I've really enjoyed working on this project, and I definitely want to come back to it someday.