Input Devices

object detection
endless challenges of object detection using computer vision
For this week, and in line of explorations for my final project, I decided to look in depth on how to detect object using computer vision, therefore my input device: camera. In week 6 (embedded programming) I had explored several input devices from the samdino project documentation, and I added the ESP32 Cam to the studies. In that week, I did not dive into object detection, but rather stayed within basic programming and communications via wifi to get it to work. So in this week, my plan was to explore how I can detect objects using a camera, knowing that the challenge is computational power. Once that is figured out, I will have to use computer vision to translate the visual image into location based where I can pick those objects.

missing link
missing link
My first attempt was using Edge Impulse, a machine learning training based on neural network classifiers. The first step was to have a dataset that was large enough. Since this was just a test, I did not spend alot of time of curating my dataset. Instead, I used around 40 images of rocks. To begin with on the website, one has to create an account and start a new project. Once that is done, it is required to connect a device. In my case since the information was on the laptop, I connected it to input the data.
missing link
missing link
Next was data acquisition. Here is we input the images and classify them into a split between training data and testing data with a quantitative hierarchy to the earlier. Here I realized that 40 images is no where enough, but since this was a test, I decided to proceed as is. The data was split into 79% training and 21% testing.
missing link
Once data is in, you have to create an impulse. The UI of the website is very well done and following simple tutorials would be sufficient to navigate the required steps from input to output. I used a 96x96 pixel format, keeping the file lighter in hope that the ESP32 processor would be powerful enough to handle a machine learning library.
missing link
missing link
Since the field of machine learning is new to me, I had to revert to other sample projects to compare where my data, model and training stands. Again, it was clear that my dataset is too small to give precise readings. It is recommended that a dataset is no longer than 100+ images. To move forward and check if the ESP32 cam handle the library and can atleast give a reading was a priority at this point. So using the same data set I exported the project training library into an arduino file format to be included in my code within arduino's IDE.

Code and compilation begins now that I had my library exported, and so does the errors...
Error 1
                compilation terminated.

                exit status 1
                ei-ibrahim2595-project-1-arduino-1.0.1.h: No such file or directory

                This report would have more information with
                "Show verbose output during compilation"
                option enabled in File -> Preferences.

I realized that the library gives the wrong name to the outer zip file, and therefore I had to look within the folders to see the correct library name that is read by arduino.

Error 2
                Error compiling for board ESP32 Wrover Module.

                This report would have more information with
                "Show verbose output during compilation"
                option enabled in File -> Preferences.
I was unable to resolve this issue even with changing the file names and paths. I figured that since I am using an older version of Arduino (1.8.12), newer versions might have resolved the character issues. Although I faced other problems earlier with the latest arduino ide release, I decided to update it to the latest version (1.8.16) and it worked!
missing link
missing link
The big moment is here, testing if the communication is working and if I will receive successful readings. It worked (kinda)! After all the errors and troubleshooting for a full day, I managed to get it to work. That being said, as expected the data was not enough to properly read the information and the 96x96 px was a very low resolution for proper data reading and accurate results. Fixing the dataset problem by adding a larger number of images would slow down the processing even further. Infact I was unable to do live detection, instead I had to capture and image and process that.

Attempt 2
My second attempt after the slow and unreliable processing using the ESP32 CAM with edge impulse, is to try using a much powerful raspberry pi 3 board. This was my first time using it and I was surprised by how powerful the tool is. Luckily we had a camera and its connection at the lab so I immedietly began setting up the board, downloading the OS system and required raspberry pi documentation. I hooked the board to a screen, keyboard, mouse and a power supply and got to work! I began by writing a quick python code to test if the camera is working and that I had established a connection with the device.
            from picamera import PiCamera 
            from time import sleep

            camera = PiCamera()

missing link
Firstly, I tested out where the system works and more specifically if the camera is functioning and that I am able to program it via python within raspberry pi. The test works and I was able to receive signals from the camera based on a timer. That being said, The larger challenge was object detection, so I began looking through open computer vision librarries to find a starting point. The added complexity of the project was to the many dependencies and libraries python needed to run an object detection program.

Some of the libraries were:
sudo apt-get install build-essential cmake pkg-config
sudo apt-get install libjpeg-dev libtiff5-dev libjasper-dev libpng12-dev
sudo apt-get install libavcodec-dev libavformat-dev libswscale-dev libv4l-dev
sudo apt-get install libxvidcore-dev libx264-dev
sudo apt-get install libgtk2.0-dev libgtk-3-dev
sudo apt-get install libatlas-base-dev gfortran

sudo apt-get install python3-dev
sudo apt-get install python3-pip
pip3 install opencv-python

sudo apt-get install libqtgui4
sudo modprobe bcm2835-v4l2
sudo apt-get install libqt4-test
missing link
Python Code:
            # How to run?: python --prototxt MobileNetSSD_deploy.prototxt.txt --model MobileNetSSD_deploy.caffemodel
            # python --prototxt MobileNetSSD_deploy.prototxt.txt --model MobileNetSSD_deploy.caffemodel

            # import packages
            from import VideoStream
            from import FPS
            import numpy as np
            import argparse
            import imutils
            import time
            import cv2

            # construct the argument parse and parse the arguments
            ap = argparse.ArgumentParser()
            ap.add_argument("-p", "--prototxt", required=True,
                help="path to Caffe 'deploy' prototxt file")
            ap.add_argument("-m", "--model", required=True,
                help="path to Caffe pre-trained model")
            ap.add_argument("-c", "--confidence", type=float, default=0.2,
                help="minimum probability to filter weak predictions")
            args = vars(ap.parse_args())

            # Arguments used here:
            # prototxt = MobileNetSSD_deploy.prototxt.txt (required)
            # model = MobileNetSSD_deploy.caffemodel (required)
            # confidence = 0.2 (default)

            # SSD (Single Shot MultiBox Detector) is a popular algorithm in object detection
            # It has no delegated region proposal network and predicts the boundary boxes and the classes directly from feature maps in one single pass
            # To improve accuracy, SSD introduces: small convolutional filters to predict object classes and offsets to default boundary boxes
            # Mobilenet is a convolution neural network used to produce high-level features

            # SSD is designed for object detection in real-time
            # The SSD object detection composes of 2 parts: Extract feature maps, and apply convolution filters to detect objects

            # Let's start by initialising the list of the 21 class labels MobileNet SSD was trained to.
            # Each prediction composes of a boundary box and 21 scores for each class (one extra class for no object),
            # and we pick the highest score as the class for the bounded object
            CLASSES = ["aeroplane", "background", "bicycle", "bird", "boat",
                    "bottle", "bus", "car", "cat", "chair", "cow", "diningtable",
                    "dog", "horse", "motorbike", "person", "pottedplant", "sheep",
                    "sofa", "train", "tvmonitor"]

            # Assigning random colors to each of the classes
            COLORS = np.random.uniform(0, 255, size=(len(CLASSES), 3))

            # COLORS: a list of 21 R,G,B values, like ['101.097383   172.34857188 111.84805346'] for each label
            # length of COLORS = length of CLASSES = 21

            # load our serialized model
            # The model from Caffe: MobileNetSSD_deploy.prototxt.txt; MobileNetSSD_deploy.caffemodel;
            print("[INFO] loading model...")
            net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
            # print(net)

            # initialize the video stream,
            # and initialize the FPS counter
            print("[INFO] starting video stream...")
            vs = VideoStream(src=0).start()
            # warm up the camera for a couple of seconds

            # FPS: used to compute the (approximate) frames per second
            # Start the FPS timer
            fps = FPS().start()

            # OpenCV provides two functions to facilitate image preprocessing for deep learning classification: cv2.dnn.blobFromImage and cv2.dnn.blobFromImages. Here we will use cv2.dnn.blobFromImage
            # These two functions perform: Mean subtraction, Scaling, and optionally channel swapping

            # Mean subtraction is used to help combat illumination changes in the input images in our dataset. We can therefore view mean subtraction as a technique used to aid our Convolutional Neural Networks
            # Before we even begin training our deep neural network, we first compute the average pixel intensity across all images in the training set for each of the Red, Green, and Blue channels.
            # we end up with three variables: mu_R, mu_G, and mu_B (3-tuple consisting of the mean of the Red, Green, and Blue channels)
            # For example, the mean values for the ImageNet training set are R=103.93, G=116.77, and B=123.68
            # When we are ready to pass an image through our network (whether for training or testing), we subtract the mean, \mu, from each input channel of the input image:
            # R = R - mu_R
            # G = G - mu_G
            # B = B - mu_B

            # We may also have a scaling factor, \sigma, which adds in a normalization:
            # R = (R - mu_R) / sigma
            # G = (G - mu_G) / sigma
            # B = (B - mu_B) / sigma

            # The value of \sigma may be the standard deviation across the training set (thereby turning the preprocessing step into a standard score/z-score)
            # sigma may also be manually set (versus calculated) to scale the input image space into a particular range — it really depends on the architecture, how the network was trained

            # cv2.dnn.blobFromImage creates 4-dimensional blob from image. Optionally resizes and crops image from center, subtract mean values, scales values by scalefactor, swap Blue and Red channels
            # a blob is just an image(s) with the same spatial dimensions (width and height), same depth (number of channels), that have all be preprocessed in the same manner

            # Consider the video stream as a series of frames. We capture each frame based on a certain FPS, and loop over each frame
            # loop over the frames from the video stream
            while True:
                # grab the frame from the threaded video stream and resize it to have a maximum width of 400 pixels
                # vs is the VideoStream
                frame =
                frame = imutils.resize(frame, width=400)
                print(frame.shape) # (225, 400, 3)
                # grab the frame dimensions and convert it to a blob
                # First 2 values are the h and w of the frame. Here h = 225 and w = 400
                (h, w) = frame.shape[:2]
                # Resize each frame
                resized_image = cv2.resize(frame, (300, 300))
                # Creating the blob
                # The function:
                # blob = cv2.dnn.blobFromImage(image, scalefactor=1.0, size, mean, swapRB=True)
                # image: the input image we want to preprocess before passing it through our deep neural network for classification
                # mean:
                # scalefactor: After we perform mean subtraction we can optionally scale our images by some factor. Default = 1.0
                # scalefactor  should be 1/sigma as we're actually multiplying the input channels (after mean subtraction) by scalefactor (Here 1/127.5)
                # swapRB : OpenCV assumes images are in BGR channel order; however, the 'mean' value assumes we are using RGB order.
                # To resolve this discrepancy we can swap the R and B channels in image  by setting this value to 'True'
                # By default OpenCV performs this channel swapping for us.

                blob = cv2.dnn.blobFromImage(resized_image, (1/127.5), (300, 300), 127.5, swapRB=True)
                # print(blob.shape) # (1, 3, 300, 300)
                # pass the blob through the network and obtain the predictions and predictions
                net.setInput(blob) # net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
                # Predictions:
                predictions = net.forward()

                # loop over the predictions
                for i in np.arange(0, predictions.shape[2]):
                    # extract the confidence (i.e., probability) associated with the prediction
                    # predictions.shape[2] = 100 here
                    confidence = predictions[0, 0, i, 2]
                    # Filter out predictions lesser than the minimum confidence level
                    # Here, we set the default confidence as 0.2. Anything lesser than 0.2 will be filtered
                    if confidence > args["confidence"]:
                        # extract the index of the class label from the 'predictions'
                        # idx is the index of the class label
                        # E.g. for person, idx = 15, for chair, idx = 9, etc.
                        idx = int(predictions[0, 0, i, 1])
                        # then compute the (x, y)-coordinates of the bounding box for the object
                        box = predictions[0, 0, i, 3:7] * np.array([w, h, w, h])
                        # Example, box = [130.9669733   76.75442174 393.03834438 224.03566539]
                        # Convert them to integers: 130 76 393 224
                        (startX, startY, endX, endY) = box.astype("int")

                        # Get the label with the confidence score
                        label = "{}: {:.2f}%".format(CLASSES[idx], confidence * 100)
                        print("Object detected: ", label)
                        # Draw a rectangle across the boundary of the object
                        cv2.rectangle(frame, (startX, startY), (endX, endY),
                            COLORS[idx], 2)
                        y = startY - 15 if startY - 15 > 15 else startY + 15
                        # Put a text outside the rectangular detection
                        cv2.putText(frame, label, (startX, y), cv2.FONT_HERSHEY_SIMPLEX, 0.5, COLORS[idx], 2)

                # show the output frame
                cv2.imshow("Frame", frame)

                # HOW TO STOP THE VIDEOSTREAM?
                # Using cv2.waitKey(1) & 0xFF

                # The waitKey(0) function returns -1 when no input is made
                # As soon an event occurs i.e. when a button is pressed, it returns a 32-bit integer
                # 0xFF represents 11111111, an 8 bit binary
                # since we only require 8 bits to represent a character we AND waitKey(0) to 0xFF, an integer below 255 is always obtained
                # ord(char) returns the ASCII value of the character which would be again maximum 255
                # by comparing the integer to the ord(char) value, we can check for a key pressed event and break the loop
                # ord("q") is 113. So once 'q' is pressed, we can write the code to break the loop
                # Case 1: When no button is pressed: cv2.waitKey(1) is -1; 0xFF = 255; So -1 & 255 gives 255
                # Case 2: When 'q' is pressed: ord("q") is 113; 0xFF = 255; So 113 & 255 gives 113

                # Explaining bitwise AND Operator ('&'):
                # The & operator yields the bitwise AND of its arguments
                # First you convert the numbers to binary and then do a bitwise AND operation
                # For example, (113 & 255):
                # Binary of 113: 01110001
                # Binary of 255: 11111111
                # 113 & 255 = 01110001 (From the left, 1&1 gives 1, 0&1 gives 0, 0&1 gives 0,... etc.)
                # 01110001 is the decimal for 113, which will be the output
                # So we will basically get the ord() of the key we press if we do a bitwise AND with 255.
                # ord() returns the unicode code point of the character. For e.g., ord('a') = 97; ord('q') = 113

                # Now, let's code this logic (just 3 lines, lol)
                key = cv2.waitKey(1) & 0xFF

                # Press 'q' key to break the loop
                if key == ord("q"):

                # update the FPS counter

            # stop the timer

            # Display FPS Information: Total Elapsed time and an approximate FPS over the entire video stream
            print("[INFO] Elapsed Time: {:.2f}".format(fps.elapsed()))
            print("[INFO] Approximate FPS: {:.2f}".format(fps.fps()))

            # Destroy windows and cleanup
            # Stop the video stream

Successfully detecting objects in an environment.
missing link