OpenAI releases Triton, a programming language for AI workload optimization

Open AI

OpenAI today released Triton, an open source, Python-like programming language that enables researchers to write highly efficient GPU code for AI workloads. Triton makes it possible to reach peak hardware performance with relatively little effort, OpenAI claims, producing code on par with what an expert could achieve in as few as 25 lines.

Read More

Generating images using text: The dawn of the AI dreamers

People in a hallway

Over the past couple of years, there has been a lot of research and developments into creating AI models that can generate images from a given text prompt. This could be thought of as a personal artist, who tries to create an artwork by following the very words of your instruction. Now, who wouldn’t want to have a personal Picasso, but as that’s impossible, we can settle with the next very possible thing — an AI Picasso 😃.

Read More

The best 10 software apps in Artificial Intelligence

AI technology

Reformist usage of Artificial Intelligence programming on the lookout for computerization of the methods, programming, and different purposes has gotten normal. Artificial intelligence based stages include significant machine estimations and learning for robotizing business measures. Mechanization saves a ton of time and energy of the representatives. Computerization engages the association to play out the functioning even more proficiently and beneficially. Moreover, robotization assists people with refreshing their abilities and capacities

Read More

How to use image preprocessing to improve the accuracy of Tesseract

Previously, on How to get started with Tesseract, I gave you a practical quick-start tutorial on Tesseract using Python. It is a pretty simple overview, but it should help you get started with Tesseract and clear some hurdles I faced when I was in your shoes. Now, I’m keen on showing you a few more tricks and stuff you can do with Tesseract and OpenCV to improve your overall accuracy.

Read More

How you can get started with Tesseract

Setting up your open-source OCR stack from scratchIt’s far from a secret that Tesseract is not an all-in-one OCR tool that recognizes all sorts of texts and drawings. In fact, this couldn’t be further from the truth. If this was a secret, I’ve already spoiled it, and it’s already too late to go back anyway. So, why not dive deep into Tesseract and share a few tips and tricks that could improve your results?Really, though. I do appreciate all those developers who contribute to open-source projects without expecting any benefits in return. After all, they provide us with relatively difficult capabilities for individuals to achieve, such as creating Deep Neural Networks via TensorFlow without much of the hassle. Who would have thought that Machine Learning would be as accessible as it is today?Tesseract, too, helps us accomplish simple OCR tasks with a significant success rate and is completely open-source. In this post, I’ll try to get you going with Tesseract and hopefully help you clear some of the hurdles that you might face while working with it.Even though there are quite a few options in the field, for example, OCRopus, Tesseract still seems to be the first choice for most free riders. Well, that all adds up if you consider Tesseract is still being developed by the Google community and has constantly grown over time. However, if you’ve had experience with Tesseract before, you might have noticed that Tesseract is likely to disappoint you when it comes to image pre-processing or custom font training.Leaving custom font training for a later discussion, for now, I’ll be mainly focusing on going over the basics to get you started with Tesseract. First, let’s quickly go over the installation.macOSI’ll be using HomeBrew, a package manager, to install Tesseract libraries. After installing HomeBrew, you should prompt the following command.$ brew install tesseractOr, you could also do the same thing with MacPorts if you wish.$ sudo port install tesseractUbuntuOn Ubuntu, it’s quite simple as well.$ sudo apt-get install tesseract-ocrWindowsFor Windows, you can download the unofficial installer from the official GitHub Repository. What a sentence, eh?How do I know if things are installed correctly?To verify if Tesseract is successfully installed, you can hit your terminal and type the following.$ tesseract -vIf you receive a few lines of prompt similar to the one below, your Tesseract is installed correctly. Otherwise, you might want to check what has gone wrong by starting from your PATH variable in your system.tesseract 3.05.01leptonica-1.74.4libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11Installing a few more librariesTo start with, Tesseract is not a Python library. Nor does it have an official wrapper for Python. This is where all those golden-hearted developers came in and created this awesome Python wrapper, pytesseract, for us. We also need to install OpenCV and PIL for manipulating images.$ pip install pillow$ pip install pytesseract$ pip install opencv-pythonAnd that’s it!Now, you got your Tesseract installed on your computer, ready to work with Python. So what are you waiting for? Well, there’s no one holding you back; please go ahead and try it out. Here comes a “but,” though. As stated in many articles, including the official documentation, Tesseract is likely to fail without image pre-processing.What is image pre-processing?For quite some time, I’ve been a regular visitor of the ImproveQuality page in TensorFlow’s repository, where they name a few methods you can try to improve your accuracy. Although it claims to have various image processing operations internally, it’s quite often not enough. Here I’ll try to apply a few of the things we can do by using OpenCV.Let’s define a simple function that takes the image path as an input and returns the string as an output. I’m gonna get super-creative here and name this function “get_string.”def get_string(img_path):# Read image using opencvimg = cv2.imread(img_path)# Extract the file name without the file extensionfile_name = os.path.basename(img_path).split(‘.’)[0]file_name = file_name.split()[0]# Create a directory for outputsoutput_path = os.path.join(output_dir, file_name)if not os.path.exists(output_path):os.makedirs(output_path)Rescaling: Tesseract works best on images that are 300 dpi or more. If you’re working with images with a DPI of less than 300 dpi, you might consider rescaling. Otherwise, rescaling may not make the impact that you thought it’d make.I, personally, prefer to make sure that the images are at least 300 dpi rather than rescaling them later in the pipeline. However, everyone has their own preferences. You do you. # Rescale the image, if needed.img = cv2.resize(img, None, fx=1.5, fy=1.5, interpolation=cv2.INTER_CUBIC)Noise Removal: Most printed documents are likely to experience noise to some extent. Though the main reasons for this noise may vary, it’s clear that it makes it harder for computers to recognize characters. Noise on images can be removed by using a few different techniques combined. These include but are not limited to converting image to grayscale, dilation, erosion, and blurring.Dilation, erosion, and blurring require a kernel matrix to work with. To put it simply, the larger your kernel size is, the wider the region your method gets to work on. And again, there’s no one kernel size value that fits all. You need to play with numbers and, eventually, find the right values for your images.However, a good rule of thumb could be starting with small kernel values for small fonts. Similarly, for larger fonts, you may experiment with larger kernels. # Convert to grayimg = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)# Apply dilation and erosion to remove some noisekernel = np.ones((1, 1), np.uint8)img = cv2.dilate(img, kernel, iterations=1)img = cv2.erode(img, kernel, iterations=1) # Apply blur to smooth out the edgesimg = cv2.GaussianBlur(img, (5, 5), 0)As you might notice in the piece of code above, we can create kernels in two similar ways: using a NumPy array or directly passing the kernel to the function.Binarization: This is definitely a must. Let’s try to think like a computer for a second. In this cyber-reality where everything eventually boils down to 1’s and 0’s, converting images to black and white immensely helps Tesseract recognize characters. However, this might fail if input documents lack contrast or have a slightly darker background. # Apply threshold to get image with only b&w (binarization)img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]After saving the filtered image in the output directory, we can finish writing our get_string function by passing the processed image to Tesseract with the following lines. # Save the filtered image in the output directorysave_path = os.path.join(output_path, file_name + “_filter_” + str(method) + “.jpg”)cv2.imwrite(save_path, img)# Recognize text with tesseract for pythonresult = pytesseract.image_to_string(img, lang=”eng”)return resultNow, you might think that you’ve got everything ready already — but there’s always room for improvement. Wouldn’t it be nicer if we had more filtering options? In effect, there is plenty of ‘em! I’ll be explaining a few of those in my next story.For those who would like to look at my source code before moving on to the next story, here’s my complete code on GitHubGist. As you may notice, it actually does a bit more than text recognition. In fact, it does a lookup for a regular expression on a given region of an image and returns the value.We’ll get there pretty soon, no worries! Meanwhile, best be on with your day and keep on the lookout for better opportunities*.

Read More

15 free & open-source data resources for your next data science project

Resources and information

A consolidated list of free datasets organized by different categories for beginners as well as professionalsPhoto by Firmbee.com on UnsplashThere are many beginners in the field of data science since when the requirement of data scientists boosted in this pandemic. Most of the time, they have questions like where can I find datasets for machine learning/ deep learning projects? Where can I get free datasets for data science?So here I am writing a piece of useful information for every beginner from the very basic. I hope this article will be helpful to beginners as well as advanced data science professionals who were not familiar with these resources earlier.Data! Data! Data!Isn’t it everywhere but is it in ready-to-use form? Absolutely No. Before I take you guys through the list of resources that provide datasets almost around every type of field and of course for free. The very first thing that one should understand is that to apply data science skills, you first need to have a dataset in hand that too in a ready-to-use form.How to find the best datasets for specific machine-learning projects? Where to look at? Got Stuck..!! Spent a lot of time in search of it and ended up frustrated, then this short crisp guide is only for you to walk through some useful resources.If you ever worked on your own project from the scratch, you might be familiar with the obstacles you might have encountered during data collection. Data collection is the very first and most important step of all to get started in a Data-science project.There are three techniques to get a dataset :So, Below I am providing you with the top 15 useful Platforms from where you can get Datasets to get started with your journey in Data-Science.A complete list of dataset resources:-Kaggle → I am damn sure, many of you must be familiar with this one platform as it is very famous among data science people for lots of reasons. I put it here on top as well because I do use it most of the time. It is indeed helpful for the data science community as they have interesting datasets and cover almost every different aspect like health, finance, banking, education, and what not! If you ever wanted already prepared data in a structured form, make it your first choice to go to. The Kaggle datasets not only are open, accessible data formats better supported on the platform, they are also easier to work with for more people regardless of their tools. You can find a variety of file types as well, like CSVs, JSON, SQLite, BigQuery2. UCI ML repo → The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. They currently have 588 open source datasets for data science as a service to the machine learning community and have helpful datasets for machine learning projects. You can view all the UCI datasets through their searchable interface. While doing my master’s, I was unable to find any free datasets for one of my projects. Then, one of my faculty suggested I use this platform, and it did great. I found a really good and big UCI dataset to start my project. I put it in second place in the list as the UCI datasets are pretty well organized. You can even search using a type of task you are attempting like regression or classification or NLP.3. Quandl → A resource provides free datasets for the data science community which is designed for professionals and delivers financial, economic, and alternative data to people worldwide. They cover almost two types of data, time series, and tables. One can find interesting datasets for finance, economics here. In these areas, you can find pretty good datasets for machine learning projects. All you need to do is insert keywords of your interest in a search bar and choose from the results of listed datasets.To use the particular dataset, you have to visit its usage tab. Since I cannot cover everything here, please visit this documentation to fully understand how can you use this platform for completing your data requirements using APIs. Other than APIs, you can also call any financial data of your need using python library directly into your python IDE, to understand how to do that, please refer to this documentation.4. Data.gov → This particular site is maintained by the U.S. government and only they decide what to put out there as public, free datasets to be used by researchers and data science people like me and you. Here you can find free datasets arranged by different categories like agriculture, climate, energy, ocean, local government, maritime, older adults health. So if any of the listed categories interest you in any way, you can visit this platform to get the free datasets for machine learning projects. Just visit the site URL and go to the Data tab on top. This will list all the datasets. You can search as well around interested keywords. While exploring this site, I have found very good resources as well. So don’t forget to go through them once.5. Data.gov.in → This particular site is maintained by the Indian government. They release all sorts of data in almost every domain like education, finance, healthcare, and many more so that researchers like us can use it and develop some useful projects on it. You can use these free datasets for data science projects of course. You can also find image datasets for machine learning/deep learning projects here. The interesting datasets shared on this website have projects built for DRDO, ISRO like organizations. So the sensitivity is high of such free datasets in terms of correct usage.Accessing this particular is very simple:Just enter the keywords and search e.g., EducationClick on the relevant search result.The above step will take you to the catalog containing that dataset.Go through pages in the catalog, to find the right dataset.Extract the dataset in the required format.The site will ask some basic questions for using purpose, answer them and save the data.6. World bank data → The only website which works closely with bank regions and global practices to get high-level good statistical data and they maintains several macros, financial, and sector databases. They don’t compromise on the quality and quantity of data in any aspect as the goal of a world without poverty is essential to them. So if this option excites you as well, please refer to this sufficient documentation on how to download the dataset from this particular site. This documentation will help you in better understanding the site as the datasets are distributed across different sections like DataBank, Microdata, Data Catalog.7. grouplens → So basically grouplens is a research lab in the department of computer science situated at the University of Minnesota. They have developed some end-to-end data science projects like movielens, local geographic information systems, digital libraries, cyclopath, booklens. They also give access to some of the free datasets that they have acquired from research and surveys. If any of the listed projects interest you, please visit their datasets tab on the top of the site to see what is available for use and what is not.8. RBI → Reserve Bank of India has put some free data out there. If you also want to analyze money market operations, payment flows, use of banking then this site is a must to go for finding the right dataset for your next data science project. The datasets are organized by the way of collection, whether daily, weekly, monthly, etc. This site would be really helpful to perform some time series projects.9. Github repo for public datasets → An awesome public dataset repo is the repository that I found useful on Github. This repository is wonderful containing some high-quality free datasets and not to mention very well organized for different domains. Do visit this repo and don’t forget to share with me, whether you find it useful or not?10. Fivethirtyeight → is a site that writes interactive articles and makes graphics on topics ranging from Politics to Sports, Economics, Culture, and Science & health. They provide some analytical stories retrieved from a variety of open-source datasets. You can have access to the free datasets from this link, all you have to do is download the data of your interest.Another available option is their GitHub repo for accessing their interesting datasets as well as code behind creating visualizations and interactive stories.11. Data.world → This site is very handy not just for data science people but also for non-technical people who just want to get insights. People like journalists, business people to get clear, accurate, fast answers to any business question. They have a pretty good organization for free datasets in the data catalog to discover, govern, and for easy access. This is the link to the Finance directory of the open datasets present on data.world. You may choose some other category of your choice.12. Google Dataset Search → Unlike other sites, It is a search engine which is invented for finding free datasets. Google Dataset Search works just like regular google search does base on keywords provided. It matches the keywords with the description of the dataset instead of the content. On entering the specific keywords, there are pretty good chances of finding a dataset if it is publicly available. At the time of the launch, Google Dataset Search had almost 25 million different free datasets from across the globe.13. Open ML → OpenML is an open data science platform that is actually meant for machine learning research. The platform is pretty neat and clean, have all sections organized. You can find free datasets from a variety of domains like healthcare, education, climate change, politics, sports, and whatnot. Every interesting dataset can be downloaded in multiple formats like CSV, JSON, XML, etc and there is a separate page for each dataset on this site.Just visit the dataset of your choice and download it in the required formats by choosing on the top right corner.You can use this platform to perform your machine learning tasks as well and also you can take help from other’s tasks on the site and those models built by you can be shared with others so that they can also use it.14. BuzzFeed News → It is an American news website that features analytical stories. It has open-sourced its datasets, libraries, and tools they use, data and analyses, some guides for your easiness on their Github repo.15. National Center for Environmental Information → This one is the best bet if you are looking for some data related to weather and environmental conditions. NCEI is the largest repository of environmental data in the world. It has datasets related to the climatic & weather conditions across the United States, oceanic data, meteorological data, climatic conditions, geophysical data, atmospheric information, etc.ConclusionThe above-consolidated list of free dataset resources is the list starts with very well-known resources to some underrated resources, the resources which are not very popular in the data science community.I tried my best to include the best as possible because I know the struggle of not getting the right data. Also, it is very time-consuming and tedious work if you go for scraping every time. Almost all the listed data aggregator sites host open datasets.If you are planning to work on any data science project, I hope this list will help you with your first step of getting the right data. If it helps you in any way, don’t forget to comment and let me and others know. If you like the list and think it will gonna help other people as well, please applaud the article so that it can reach out to the needy ones.Cheers and best wishes to all of you !!!

Read More

[Video Highlights] A path into Data Science

Laptop in the dark

Are you interested in getting ahead in data science? On this TalkPython podcast episode, you’ll meet Sanyam Bhutani who studied computer science but found his education didn’t prepare him for getting a data science-focused job. That’s where he started his own path of self-education and advancement. Now he’s working at an AI startup and ranking high on Kaggle.

Read More

Apriori Algorithm for Association Rule Learning - How to find clear links between Transactions

Apriori association learning-rule

Most of you may already be familiar with clustering algorithms such as K-Means, HAC, or DBSCAN. However, clustering is not the only unsupervised way to find similarities between data points. You can also use association rule learning techniques to determine if certain data points (actions) are more likely to occur together. A simple example would be the supermarket shopping basket analysis. If someone is buying ground beef, does it make them more likely to also buy spaghetti? We can answer these types of questions by using the Apriori algorithm.

Read More

Some amazing Python Open-Source projects for Machine Learning in Finance or Trading

Machine Learning is needed to raise standards in the field of digitalization, automation and security; to name a few. One of the major applications of machine learning is in the field of Fintech. Predictive models have been created for various purposes in finance sector. Finance Industry has adopted and invested in machine learning for the purpose of risk assessment, algo trading, and fraud detection for financial gains. Open-source projects have become extremely popular in almost every field. In this article we are highlighting some of the amazing python open-source projects for machine learning in the field of financial trading.

Read More

DeepMind’s Reinforcement Learning Framework “Acme”

Acme is a Python-based research framework for reinforcement learning, open sourced by Google’s DeepMind in 2020. It was designed to simplify the development of novel RL agents and accelerate RL research. According to their own statement, Acme is used on a daily basis at DeepMind, which is spearheading research in reinforcement learning and artificial intelligence.

Read More

ML Pipeline end-to-end solution

After implementing several ML systems and running them in production, I realized there is a significant maintenance overload for monolithic ML apps. ML app code complexity grows exponentially. Data processing and preparation, model training, model serving — these things could look straightforward, but they are not, especially after moving to production.

Read More

Can we study Snowpack with Satellites?

Snowpack metrics are incredibly important to understanding our climate and planning for drinking water supplies. In the Western United States, roughly 75% of drinking water comes from annual snow melt according to the USGS. While snowpack is extremely important to many parts of the world, measuring snowpack is a difficult task. Especially when it comes to understanding how snow is changing over very large swaths of land in the arctic.

Read More
1 2 3 9