<![CDATA[Vasily Korf]]>https://vasilykorf.com/https://vasilykorf.com/favicon.pngVasily Korfhttps://vasilykorf.com/Ghost 3.37Wed, 26 May 2021 02:19:26 GMT60<![CDATA[Create videos with Python and become a vlogger]]>https://vasilykorf.com/random-bullet/60ac62d164d1a4cf2dc8cff0Tue, 25 May 2021 02:40:21 GMTPreface or Naive Market OverviewCreate videos with Python and become a vlogger

Video and audio are the dominant content types nowadays. Not because we attend more video calls during pandemic. Not because the leading mobile apps are becoming video-centric. But because video/audio content engage users for longer, meaning more money from ads – all hail the free market. Since its emergence at the beginning of this century, video blogging has evolved into a new industry with huge potential. Just ask Google how much money social media influencers make.

Call to Arms

How to take advantage of the trending opportunity of vlogging without spending time on recording and editing videos? Python is the answer. Here is where software development comes in and helps automate your routine.

Python is not only for AI, machine learning, data science, data engineering, and web apps. It can be used to create videos with animations and audio. And this is not about face swap or deep fake videos.

Facing the Challenge

Creating automated videos might sound easy but you have to tell the computer how to do it. Don’t count on it if you plan to generate video about your trip to Burning Man without feeding tons of photos. Although there are many solutions to generate random terrain and landscape.

Start simple. Think about video patterns, template, structure of your coming video content. In our case study it's a video dictionary or wordbook with translation and visualization.

Create videos with Python and become a vlogger

Project Skeleton

There are several things needed to create video workbook with translation and visualization:

  1. Words
  2. Phonetic transcription
  3. Translation
  4. Audio Dubbing
  5. Visualization

Words

Traditional paper dictionaries are becoming obsolete. Even top nouns obtained from online dictionaries might look plain. The idea was to understand English corpus used on the Internet.

Here comes Reddit, a social media platform, news aggregator with rich dataset and different thematic groups (subreddits). Finance domain was selected to get all text data about investment, stocks, markets. Wallstreetbets subreddit was best suited for this goal. You might have heard how they fueled Gamestop's price surge recently.

All data was parsed with an official API called PRAW and Selenium framework.

Transcription

IPA is not only a hoppy beer style but also an alphabet. The International Phonetic Alphabet (IPA) represents all the sounds humans produce. The python package eng_to_ipa can easily convert English text into IPA. Cheers!

Translation

There are many free and open source packages to translate text with bulk translation and auto language detection. Googletrans has been chosen for this project.

Audio Dubbing

Same for audio. gTTS package was used to generate audio from text, which is a CLI tool to interface with Google Translate's text-to-speech API. pydub package helps to edit and manage audio files.

Visualization

As for visualization, it can be done in several ways. We tried photo stocks parsing, flickr API but ended up with Google Photos because of more relevant output.

The final word

I’m pleased to present Random Bullet – automated video generator for learning foreign languages.

Create videos with Python and become a vlogger

The videos have been uploaded to YouTube. As for now, only English-Russian translation. Check them out:

I’d appreciate any feedback or thoughts on this project. Write a comment below or contact me on linkedin.

]]>
<![CDATA[Daily Sketches]]>https://vasilykorf.com/do-art-with-code/6025f2f39589f548343a8df3Fri, 12 Feb 2021 03:27:55 GMT

My daily sketches site: https://vasilykorf.com/daily-p5

Motivation

I was inspired by Zach Lieberman and Tim Rodenbröker works and the creative use of technology, colors, and shapes. Shout out to Nina Lutz who shared the site template for daily sketches.  

Technology and Workspace

I made sketches using p5.js: started off with the basics, refreshed topics in trigonometry, exploring polar coordinates, and ways of drawing simple shapes. It is worthwhile to mention that the same technology we used in the DataSound project four years ago.

Tools I use are pretty simple: VS code and p5.vscode plugin.

Daily Sketches

It’s quite fun to write code and quickly create visual output. p5.js is as intuitive as possible, paving the way to a clearer understanding for everyone.

Do art with code and happy sketching!

And again, check my works here: https://vasilykorf.com/daily-p5

]]>
<![CDATA[Find pizza with AI help]]>https://vasilykorf.com/find-pizza/601b54579589f548343a8d43Thu, 04 Feb 2021 03:16:52 GMT

There are only 124 authentic pizza places out of 1579 in New York. This is what an AI said. Combining computer vision and machine learning to ease the search for Neapolitan pizza based on photos from public crowd-sourced reviews.

Check your city! Neapolitan pizza finder is available on the following link: https://vasilykorf.com/pizza/

Outline

There are many different types of pizza to order: New York-style pizza, Chicago-style deep-dish pizza, Detroit-style square pizza, you name it.

Although, there’s only one pizza above all others in terms of its taste and simplicity: Neapolitan pizza. That’s classic! Let me google for you to show some examples:

Find pizza with AI help

Guess you got the idea. In other words, a really good, well-made Neapolitan pizza is a unique experience. It's something you can sit back and enjoy. This type of pizza was invented in Naples, Italy. Italian law insists Neapolitan pizza must include wheat flour, flour yeast, mineral water, peeled tomatoes, mozzarella cheese, sea salt & olive oil. That’s it. So why should you spend time scrolling photos of pizza from restaurant reviews if computer vision can do it for you?

Research design

I teamed up with my friend Dmitrii Stepakov to build this tool, including parsing and labeling data, defining tech stack, applying proper machine learning techniques, and deploy a website with predictions and basic UI.

Dataset

It is a binary classification problem. Since our target is Neapolitan pizza, we have to look at pizza places in Naples and get photos from public reviews. Moreover, we need the same amount of samples for non-Neapolitan pizza, like trashy pizza from Papa John’s or mediocre pizza around the corner cooked on an electric grill. Google, yelp, foursquare APIs, and imageNet database are the main sources.

Find pizza with AI help
Source: Hybrid Knowledge Routed Modules for Large-scale Object Detection

It's not as simple as that. Data Scientists, not coincidentally, spend a lot of time cleaning, verifying, and organizing data. This project is no different. After reviewing the first batch of data, we noticed that there are many photos of interiors, visitors, street signs. That brought additional module – image recognition or pizza detection in our case.

We ended up having about 6k labeled photos of pizza.

Food Detection with CV

The first step was to identify food on photos from our dataset. This might sound easy if you would like our model to provide labels such as “burger” or “pasta”. Although, we were counting on getting objects features for further model, like types of dough, pizza sizes. We approached the problem from different sides and tried several solutions.

Find pizza with AI help
Pizza detection with CV: dataset and predicted labels

First attempt – Histograms of Oriented Gradients (HOG)

Histograms of Oriented Gradients for Human Detection is the scientific paper that inspired us.

The model gives very good results for person detection. Nevertheless, after we applied HOG as a featured and trained SVM to classify images the result wasn’t promising at all. The implementation shows accuracy between 68-75%.

Find pizza with AI help

Second attempt – OpenCV

Subsequently, we reframed the problem and tried to segment and detach pizza object with OpenCV and SVM.

There have also been many attempts to detect pizza rim and use it for classification as a feature. The initial idea came from this paper: Pizza sauce spread classification using colour vision and support vector machines. Also, here is a great article describing the SVM algorithm for image classification.

However, the result was pretty bad, it’s close to 60% of the accuracy. Here is an example:

Find pizza with AI help
Edge detection with OpenCV

Third time's a charm, right?

Sadly, the abovementioned attempts ended in failure, meaning nothing worked as well as we expected. That said, we had to put it simply. The last and the most effective try was made with TensorFlow CNN. Convolutional neural networks are the best way to solve this problem. Tensorflow has a great dataset food101, including 101 food categories. This model shows 82.5% accuracy.

Pizza detection

For the second part, which is Neapolitan pizza detection, ResNet50 came to the aid. Broadly speaking, ResNet is an updated version of CNN, it corrects CNN flaws by using shortcuts between layers. Default ResNet50 classifies photos from ImageNet. Seeing that we had to retrain it on our pizza dataset and did finetuning. Surprisingly, it works.

Find pizza with AI help
Resnet50 architecture. Source: https://morioh.com/p/dd3ffff216c5

Tech Stack

Find pizza with AI help
Project pipeline

To make a long story short, we use:

  • TensorFlow
  • ResNet50
  • Folium
  • OpenStreetMap

Conclusion

The model shows 94,6% accuracy on a test set. Given that, we extrapolated the model to the biggest cities in the US and visualized results using Folium and OpenStreetMap.

It is readily checked that the initial idea is satisfied. I did real-world testing and ordered pizza from two places in Denver suggested by the model – expectations have been fully met.

Check your city – https://vasilykorf.com/pizza/

Find pizza with AI help

Share your feedback, bug report, new city request in the comments below or via email.

And enjoy your meal!


]]>
<![CDATA[Detect underdog stocks to buy during pandemic]]>https://vasilykorf.com/covid-stocks/5fcaf7bcbb634cb48f6ad9dcMon, 07 Dec 2020 19:37:04 GMTDisclaimer: Investing and trading involve risk. The investing strategy or tactic mentioned in this article is for educational purposes only. Use it at your own risk. And think twice.

Intro

Detect underdog stocks to buy during pandemic

The pandemic is sending shocks through the global economy, Coronavirus has decimated air travel, depressing retail trade, the impact is being felt by all businesses around the world. Boeing (NYSE:BA) is one of the canonical stock price change examples of how the market reacts to the virus outbreak.

Detect underdog stocks to buy during pandemic

Bet you can name several industries, economic sectors, or companies with the same price pattern just off the top of your head: hotels, casinos, restaurants, entertainment. Nevertheless, there are about 6k companies that trade on the NYSE and Nasdaq and 11k securities, including mutual funds, ETFs, forwards, futures, etc. Pretty wide variety of options. You won’t manually check them all, Python will.

The idea of this post is to programmatically define underdogs like Boeing that look well-positioned to capitalize on recovery. Let’s take a closer look.

Define the COVID pattern

Firstly, we have to simplify the time series from the example above and describe the form: pre-pandemic state, huge drop in March, and the plateau of uncertainty with a low-volatility price. You might call this pattern a reverse sigmoid curve (f nerds), but let’s name it porebrick instead for simplicity's sake, which stands for curbstone in English.

Detect underdog stocks to buy during pandemic
Reverse Sigmoid and Porebrick. Source: researchgate.net, monolit-gbi.ru

Find similar positions

Since we understand what pattern we’re looking for, we have to set a benchmark. One way to do this is to randomly select several stocks from different industries most impacted by COVID (airlines, casinos, hotels) and aggregate them into one feature. Thereafter, we need to compare this aggregated time series with each security on the market and measure similarity as a result.

Many ML techniques i.e. Recommender systems, Clusterization, NLP are based on a similarity between vectors or entities. There are several similarity measures and distance metrics, such are:

  1. Cosine Similarity
  2. Pearson’s, Spearman’s Correlations
  3. Jaccard similarity
  4. Euclidean Distance
  5. Manhattan Distance

You can find Maths behind these metrics on the Internet, however, these metrics can’t be properly scaled to time series with millions of observations. Also, we are dealing with financial data and designing the strategy for the market with many players, seeing that you should develop a model that uses an algorithm a little bit off the beaten path.

Detect underdog stocks to buy during pandemic
Source: www.aaai.org

Here comes Dynamic Time Warping. The algorithm from the mid-90s matches similar temporal sequences by taking time shifts into account. It’s actively used in speech recognition, object tracking, and other domains. It finds optimal alignment between a pair of time series even if one has some delay or shift. The original DTW has quadratic complexity, luckily two faster algorithms have been developed recently: FastDTW[1] and UCR-DTM[2]. It's spot-on.

Implementation

Get data

There are many solutions to pull financial data since the financial APIs market actively grows. I’d like to highlight some of them: pandas_datareader, googlefinance, investpy, FMP (250 requests limit).
The code below pulls, normalize, aggregate, and save to data frame stock prices of random companies from industries impacted by COVID – airlines, casinos, hotels.

# benchmark tickers
airlines = ['BA', 'SAVE', 'JBLU']
hotels = ['MGM', 'MAR', 'H']
casinos = ['LVS', 'WYNN']
covid_pattern_benchmark = covid_pattern + hotels + casinos
 
# pull data
covid_stocks = get_ticket_data(tickets=covid_pattern, data_source=source, start=start, end=end)
covid_stocks['total_mean'] = covid_stocks.mean(axis=1)
covid_stocks_norm = covid_stocks.apply(lambda x: normalize_data(x), axis=0)

Apply Dynamic Time Warping

Fortunately, Dynamic Time Warping has been implemented in Python, the package called fastdtw. The following chunk of code iterates across all stocks on the market and measures similarity with our covid pattern benchmark. As a result, I’ll get the dictionary with ticker and similarity value.

similarity_dic = dict()

for ticker in all_stocks_df.columns:
   covid_pattern_ts = np.array(covid_stocks_norm['total_mean'])
   stock_ts = np.array(all_stocks_df[ticker])

   # measure distance
   distance, path = fastdtw(covid_pattern_ts, stock_ts, dist=euclidean)
   print(ticker, distance)

   # update dic
   similarity_dic[ticker] = distance

Sanity check

Now we have a few hundreds of selected stocks similar to the covid pattern benchmark. Instead of developing a new performance metric, I randomly picked ten tickers and did this plot:

Detect underdog stocks to buy during pandemic

Have you noticed porebrick? I have no doubt. I reviewed selected stocks and found that DTW approach was able to detect the pattern in different sectors as well: amusement parks, movie theatres, cruise lines, commercial real estate, Caribbean hotels group. A pretty good result given that we used only technical analysis.

I’ve uploaded JSON with resulting tickers and distance values (the lower the better).

Rank them out

Empirically proven – all selected companies have the COVID pattern described above. There still are hundreds of options to buy. Make your first cut by removing some industries upon your preference, then you can use regex to filter banks or financial organizations by keywords. Apply a no-trade list if applicable.

The next thing to do is to sort stocks by Volume and dividend yield. This could slightly hedge your investment bets and avoid low-volume stocks (risk of bankruptcy).

Here is an example of calling dividend stats with investpy:

stock_div = []
for ticket in list(watch_list.keys()):
   print(ticket)
   try:
       stock_div.append(
           {
               'ticket': ticket,
               'div': investpy.stocks.get_stock_dividends(stock=ticket)['Dividend'].iloc[0],
               'div_date_update': investpy.stocks.get_stock_dividends(stock=ticket)['Date'].iloc[0],
               'div_type': investpy.stocks.get_stock_dividends(stock=ticket)['Type'].iloc[0],
               'div_yield': investpy.stocks.get_stock_dividends(stock=ticket)['Yield'].iloc[0],
           }
       )
   except:
       stock_div.append({'ticket': ticket})

As a result, you ended up with dozens of stocks that you can roll into your portfolio.

Call to action

Warren Buffett strongly advises investing in index funds and holding for long term. I can not argue with that. After you get the final watchlist of stocks, just find the mutual funds that hold them – an easy way to diversify your portfolio.

You might have noticed that I didn't take November data into account. The idea is to use November growth rate as a risk tolerance features. If you opt for conservative actions, simply filter out all stock with positive growth in November.

Having said all that, investing is risky, and further upside is hard to justify, even under optimistic long-term growth assumptions.

Detect underdog stocks to buy during pandemic
Ōmishima Island, Japan, 2019

Epilogue

It shouldn't come as a surprise that any strategy you glancing through won’t crack the market. Even though this was an example of Dynamic Time Warping in finance applications, which you can improve, reinforce, or transfer to different domains: Churn Prediction, Customer Lifetime Value (CLV) Prediction, Customers behavior patterns segmentation.

I hope you enjoyed and learned something new. It’s your turn now.

Request jupyter/Colab/Datalore notebook in comments if you need it – trying to engage the audience.

Resources and further reading

  1. Using Dynamic Time Warping to FindPatterns in Time Series.
  2. FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space.
  3. UCR-DTW project and journal article.
  4. Video that gives a nice introduction to DTW.
]]>
<![CDATA[Modernisme meets StyleGAN]]>https://vasilykorf.com/doors-stylegan/5fb5df51b974770af99e67fbWed, 25 Nov 2020 07:14:14 GMT

I've spent some time training a StyleGAN2 model on architectural elements, specifically, I took into account only photos of doors mostly, fences, and windows from Barcelona, which represent the Catalan modernism movement. Here are some results from the generative adversarial network and some experimentation with model interpolation.

Why Modernisme?

Movement of architecture at the juncture of 19th and 20th centuries, centred in the city of Barcelona. Gaudí first comes to mind. Loosely speaking, Catalan modernism, or Modernisme for short, is an equivalent to Art Nouveau in France – the natural world was a central inspiration in this movement: sinuous, organic lines, use of nature.

Therefore, creating a Generative adversarial network that will be able to detect typical elements and main features of the Modernism movement is a great goal.

If the photo of the Paris Métro entrance rings a bell, you're all set.

Modernisme meets StyleGAN
Source: artnouveau.pagesperso-orange.fr

First Things First – Dataset

Finding and preparing the dataset is the most challenging part of experiments with neural networks. For my project, I parsed roughly 5k photos using Flickr API, which is a relatively small dataset, given that pre-trained StyleGAN model used about 70k.

An example of images that I used for training.

Modernisme meets StyleGAN
The sample of data used to train the model

Preprocessing

It is rumored that Data Scientists spend most of their time cleaning and preparing data rather than building models – very true.

My preprocessing pipeline for this dataset consists of the following steps:

  1. Object detection with OpenCV (doors or quadrangles in our case).
  2. Image Alignment using Homography.
  3. Cropping.
  4. Splitting items with 1:3 aspect ratio (like the narrow doors above) into three squared photos.
  5. Image Augmentation, including shifts, shear, and flips.

These simple steps help me to increase the dataset to the 70k benchmark mentioned above.

Due to the limitation of my cloud resources – a single GPU (Tesla V100 NVIDIA), I resized my dataset to 256*256.

Basic Door Anatomy

This section is more for non-native English speakers like me to be more familiar with the terminology. The body of any door is essentially made of three parts: stiles, rails, and panels.

Modernisme meets StyleGAN
Source: salisburyjoinery.com

Training networks

I used the official TensorFlow Implementation of StyleGAN2.

To reduce memory consumption, I decreased the number of channels in the generator and discriminator, latent size, and batch size.

The training process took about 40 hours.

Results

The monochrome cover photo of this post shows the results of GAN model; here is the set of artificial doors in color.

As you can see, the model perfectly catches the proportions, ratios between rails and panels, key elements of doors. There are also organic lines in patterns, so the model represents the creativity of Art Nouveau architecture. AI Modernism, isn't it?

Modernisme meets StyleGAN
Generated doors

I did interpolation in StyleGAN's space with latent walk approach to get this GIF.

Modernisme meets StyleGAN

This is it! Let's call it DoorsStyleGAN. You can download my pre-trained model (.pkl file) from this link. Use it for Transfer Learning, Style mixing, CycleGAN, or for any other purposes.

Shoot a comment below if you're in need of Google Colab notebook implementation. UPDATE: Here it is.

]]>
<![CDATA[DataSound]]>https://vasilykorf.com/datasound-hackathon/5fae0c01b974770af99e67dcFri, 13 Nov 2020 04:32:40 GMT

Generate your own music from financial data. DataSound is a hackathon project I teamed up with my colleagues to build this tool. I designed the concept and set data infrastructure and pipelines.

Can financial data sing? Of course! For example, if your financial data looks like this, with timestamps in the column headings and license id or purchase id along the left side:

DataSound
Financial data sample

Inspiration

The data above remains me a piano roll – music storage used to operate piano. It's in production since the end of the 19th century and still being manufactured today. MIDI files mimic this idea, too.

DataSound
Piano Roll

Sound Synthesis

We used product purchases data as an input of our model. Since mainstream music generally consists of seven notes and repeat at the octave, we had to normalize our time series first by binning them into notes.

DataSound
Resampling and Binning of Time Series

Harmony is another building block of music. We rounded our notes to different scales and arranged to a tonic, predominant and dominant structure.

DataSound
Notes harmonisation

Tech stack

  • p5.js
  • tone.js
  • node.js + express.js

Web-interface

Released album on Bandcamp

Teammates

  • V.Grigoriev
  • A.Kireev
  • A.Kotenko
  • S.Kurilov
]]>
<![CDATA[Learning Path to TensorFlow Developer Certification]]>https://vasilykorf.com/path-to-tensorflow-certificate/5fa70e849dfac105ccabf247Sun, 05 Jul 2020 06:52:00 GMT

It is said that the certificate is an official validation confirming your proficiency with TensorFlow with respect to solving deep learning and ML problems in the AI-driven job market. How accurate is this?

Does it make sense to have Tensorflow or any IT certification?

Frankly, not at all. In my view certifications are not worth it ($100 for TF certificate) and do not show up your ability. Also, I'm a bit skeptical about vendor-based certification, it is basically a money grabber or advertisement of technology. You won’t get a wage increase, promotion, or job offer right after.

There are some features that separate the TF exam from others though. Some of the advantages are:

  • The format of the exam is groovy, nonlinear, and worth preparing for. You’ll be writing your code in IDE and submit your model for evaluation with the TF plugin.
  • Each model covers a specific topic – a good way to organize your knowledge which you can apply in the industry.
  • Tensorflow framework is trending. The exam helps you to learn basic concepts and be comfortable with the framework.
  • Feeling of completion – pleasant extra.

Exam structure – 5 hours, 5 models

In short, you need to develop five models within five hours:

  1. Basic TensorFlow developer skills – create and save a simple neural network, know how to debug.
  2. Machine learning (ML) and deep learning (DL) foundation – build, train, and tune model, apply callbacks, deal with different data.
  3. Image classification problem – image recognition, object detection, CNNs, image augmentation.
  4. Natural Language Processing (NLP) – binary, multi-class categorization, embeddings, LSTMs, RNNs, GRU layers, CNNs.
  5. Time series forecasting – data preparation (trailing and centered windows), metrics understanding, learning rate adjustments, RNNs, CNNs.

Check the official exam page and Candidate Handbook.

Hints and advice

  1. Focus on the Tensorflow framework. You don’t need common packages like scikit-learn or Pandas for this exam.
  2. Learn how to use official documentation. You’ll be creating models, therefore, think about TensorFlow datasets, data preparation, input-, output-related methods. Spend some time understanding the structure of TF objects.
  3. Know the ways to tune, tweak, and monitor your TF models.
  4. Switch from Jupyter or Colab to IDE. Be familiar with your environment and test it, in the case of this exam, it’s PyCharm, Tensorflow certification plugin bundle, and specific packages. Follow this official guide to setup your environment properly.
  5. If you are new to PyCharm, learn how to use it efficiently, this can save you time during the examination. This minimum knowledge set includes hotkeys to run and modify scripts, debugger, and usage of scientific mode.
  6. Log your learning progress. Dates of completed courses/books/projects could well be enough.
Learning Path to TensorFlow Developer Certification
Belgrade, 2019

Resources to prepare yourself for TF exam

If you have experience in developing models in TensorFlow, the exam should be relatively easy for you.

Basis

Laurence Moroney leads AI Advocacy at Google and he is the ideologist of this exam and the co-author of the related courses: “DeepLearning.AI TensorFlow Developer Professional Certificate” (the former name of this course is Tensorflow in Practice Specialisation). This specialization consists of four courses that cover almost everything you’ll need to pass the exam. Also, you can audit these courses for free. Coursera changed UI/UX, so it might be tricky to find this option. All you need is to visit one of the courses from specialization, for instance, Introduction to TensorFlow for AI, ML, select Syllabus section, and audit the content.

Again, If you have been working in the data-related field and have prior Python and TensorFlow experience, just go across the abovementioned course and you’ll be good, otherwise, you have to spend more time on preparation. Specifically, I suggest creating a simple project from scratch for each of the exam models respectively: basic CNN, Image recognition, NLP, Time Series.

Optional Courses

Books

Good luck with that!

Learning Path to TensorFlow Developer Certification

As usual, shoot a comment below if you're in need of a general TensorFlow cookbook or if you have any questions.

]]>
<![CDATA[Hosting your blog for peanuts]]>https://vasilykorf.com/hosting-your-blog-for-peanuts/5fade9d4b974770af99e67aeSat, 13 Jun 2020 02:05:00 GMTHosting your blog for peanuts

#serverless #static #hosting

If you have a blog or a static website, you can combine a bunch of tools to provide a super-fast solution that costs less than a cup of coffee a month.

Hosting is always a compromise of reliability, scalability, pricing, and tech buzz. The good news is that Amazon provides file storage called S3, moreover, it allows you yo host your static website on S3. It is rumoured that static is the new dynamic, so here you go.

I use this approach, combining a few additional tools:

  • S3 is a way to store files in the cloud
  • Route 53 is a DNS service to bridge your domain with your content
  • CloudFront to distribute your site over HTTPS

Prerequisites

  • Buy domain name – Google Domains, GoDaddy, or any other provider
  • Create an AWS Account
  • Setup the AWS CLI

Amazon AWS has tutorials for setting almost everything up. Official instructions for configuring the AWS CLI are available here.

Hosting your blog for peanuts

Install the Ghost and run it locally

Follow this tutorial from ghost platform. You’ll be able to see your blog with this link http://localhost:2369

Generate a static version of your blog

You have to generate assets for static website from your localhost ghost blog. I found several solutions to do this:

  • HTTrack – not well maintained, but works good tho, except thumbnails images, they need to be set explicitly.
  • Buster – completely abandoned, I’d rather skip it.
  • Ghost-static-site-generator – my current working solution, allows to parse your site with one command:
gssg --url https://yourdomain.com

Create S3 bucket

You can create the S3 bucket for your site using CLI or AWS Console. Name it youdomain.com, enable "Static website hosting" in bucket’s properties, set index.html as the default index document, allow all public access and set the following bucket

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "PublicReadForGetBucketObjects",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::www.rezamousavi.com/*"
        }
    ]
}

Also, save you bucket pathname, you’ll need it later http://yourdomain.com.s3-website.eu-central-1.amazonaws.com/

Then create the second bucket and name it www.yourdomain.com. This bucket should redirect requests to your fist one. Set bucket website endpoint to the path of your first bucket.

Upload our blog static files to the S3 bucket

It’s relatively easy with AWS CLI:

aws s3 sync static_site_folder s3://yourdomain.com --acl public-read --delete

Now by checking the URL of your first bucket you’ll see your static website.

Create CloudFront distribution for our S3 bucket

Create a web distribution using HTTPS with CloudFront. Add you first bucket URL in the “Origin Domain Name”. Redirect HTTP to HTTPS. Set custom SSL certificate, which you can obtain with AWS Certificate Manager. After the distribution has been deployed, you’ll get the URL like *.cloudfront.net. You can access your website by visiting it.

Update our domain DNS records

The last thing is to route your domain and CloudFront URL. Unfortunately, Google Domains doesn’t support CNAME-like functionality at the zone apex, but you can have Google Domains forward your root domain to your www CNAME. I go for AWS Route 53 instead.

In Route 53 console create a new hosted zone for your domain. Set Alias Target to your CloudFront distribution URL.

Additionally, create one more record with www.yourdomain.com and list it in the alternate domain name in distribution preferences.

Update newly obtained DNS in your domain provider console/admin panel.

Important notice

CloudFront caches your s3 files for one day. If you like to update your website instantly, you should invalidate your objects in CloudFront distribution cache and push new changes. Watch out free tier limits. Learn more on AWS knowledge center.

Ready to go. Happy hosting and blogging!

]]>
<![CDATA[Juiced Group in Pandas]]>https://vasilykorf.com/power-pandas-group-by/5fa70c639dfac105ccabf208Wed, 20 May 2020 23:35:08 GMT

It is generally considered that pandas is one of the most popular python libraries for data science. The first and most important thing is understanding the syntax of the package.

Pandas package has a number of aggregating functions that reduce the dimension of the initial dataset. It goes with a set of SQL-like aggregation functions you can apply when grouping data during feature engineering step. Here’s a quick example of how to group on multiple columns and summarise data by applying multiple aggregation functions using Pandas.

Create a dataset

import pandas as pd

data = {
    "State": ["Alabama", "Alabama", 
              "Arizona", "Arizona", 
              "California", "California", 
              "Colorado", "Colorado", 
              "Florida", "Florida"],
              
    "City": ["Montgomery", "Birmingham", 
             "Phoenix", "Tucson", 
             "Los Angeles", "Sacramento", 
             "Denver", "Colorado Springs",
             "Tallahassee", "Miami"],
             
    "Population": [198218, 209880, 1660272, 545975, 3990456, 
                   508529, 716492, 472688, 193551, 470914],
                   
    'Real-Estate Tax': [0.42, 0.42, 0.69, 0.69, 0.76, 
                        0.76, 0.53, 0.53, 0.93, 0.93]}
    
df = pd.DataFrame(data)
print(df)

Output:

State City Population Real-Estate Tax
0 Alabama Montgomery 198218 0.42
1 Alabama Birmingham 209880 0.42
2 Arizona Phoenix 1660272 0.69
3 Arizona Tucson 545975 0.69
4 California Los Angeles 3990456 0.76
5 California Sacramento 508529 0.76
6 Colorado Denver 716492 0.53
7 Colorado Colorado Springs 472688 0.53
8 Florida Tallahassee 193551 0.93
9 Florida Miami 470914 0.93

Grouping by specific columns with aggregation functions

To group in pandas use the .groupby() method.

The following code will group by 'State' and 'Real-Estate Tax'. To apply aggregation functions, simply add key:value pairs as dictionary to .agg() method.

I’d recommend setting specific prefix over resulting columns to avoid possible duplicates and make your code more coherent.

Don't forget to reset index  – multi-index notation isn't sklearn friendly.

grouped_df = df.groupby(['State', 'Real-Estate Tax']).agg({'Population': ['mean', 'min', 'max']})
grouped_df = grouped_df.add_prefix('population_')
grouped_df = grouped_df.reset_index()
print(grouped_df)

The full list of aggregation functions

mean(): Compute mean of groups
sum(): Compute sum of group values
size(): Compute group sizes
count(): Compute count of group
std(): Standard deviation of groups
var(): Compute variance of groups
sem(): Standard error of the mean of groups
describe(): Generates descriptive statistics
first(): Compute first of group values
last(): Compute last of group values
nth() : Take nth value, or a subset if n is a list
min(): Compute min of group values
max(): Compute max of group values

In Conclusion

In light of the above use pandas group by method, apply aggregation functions as many as possible. You can easily drop highly correlated columns afterwards.

If you are interested in another example of group by, check this guide on custom aggregation functions for pandas.

]]>
<![CDATA[Highlighting code in Ghost]]>https://vasilykorf.com/highlighting-code-in-ghost/5fa70c639dfac105ccabf20aFri, 15 May 2020 22:03:25 GMT

With Ghost you can easily embed code in posts. Although, with markdown language you face lack of code formatting and get code block without syntax highlighting. After small research I opted for Prism – syntax highlighting engine.

# Python
import pandas as pd
df = pd.DataFrame()
Out of the box Ghost code block

Adding Prism Syntax Highlighting to your blog

Code Injection is the simplest way to add Prism to you theme.

Add the following lines of code to the Site Header section:

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.16.0/themes/prism-okaidia.min.css" integrity="sha256-Ykz0nNWK7w4QWJUYR7OraN4773aMB/11aMt1nZyrhuQ=" crossorigin="anonymous" />

    <style type="text/css" media="screen">
        .post-full-content pre strong {
            color: white;
        }
        .post-full-content pre {
            line-height: 1;
        }
        .post-full-content pre code {
            white-space: pre-wrap;
            hyphens: auto;
            line-height: 0.7;
            font-size: 0.7em;
        }
    </style>

Another injection to the Site Footer:

<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.16.0/components/prism-markup-templating.min.js" integrity="sha256-41PtHfb57czcvRtAYtUhYcSaLDZ3ahSDmVZarE0uWPo=" crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.16.0/components/prism-javascript.min.js" integrity="sha256-KxieZ8/m0L2wDwOE1+F76U3TMFw4wc55EzHvzTC6Ej8=" crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.16.0/components/prism-css.min.js" integrity="sha256-49Y45o2obU1Yv4zkYDpMDyAa+D9sgKNbNy4ZYGRl/ls=" crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.16.0/components/prism-sql.min.js" integrity="sha256-zgHnuWPEbzVKrT72LUtMObJgbwkv0VESwRfz7jpdsq0=" crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.16.0/components/prism-sass.min.js" integrity="sha256-3oigyyaPovKMS9Ktg4ahAD1R6fOSMGASuA03DT8IrvU=" crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.16.0/components/prism-python.min.js" integrity="sha256-zXSwQE9cCZ8HHjjOoy6sDGyl5/3i2VFAxU8XxJWfhC0=" crossorigin="anonymous"></script>

Find the link to specific library and list it in your injection: https://cdnjs.com/libraries/prism.

And then it looks like this:

# Python
import pandas as pd  
df = pd.DataFrame()  
Set Pandas dataframe
]]>
<![CDATA[Flickr integration]]>https://vasilykorf.com/flickr-integration/5fa70c639dfac105ccabf207Fri, 15 May 2020 05:04:32 GMT

Ghost platform supports Flickr via OEmbed integration. Simply copy the URL of the photo and paste it into editor.

Flickr integration
]]>
<![CDATA[Pacific Northwest]]>The way I see Seattle and Portland.

R0017538
]]>
https://vasilykorf.com/pacific-northwest/5fa9ce6e11e43407066549edSun, 09 Feb 2020 00:00:00 GMT

The way I see Seattle and Portland.

Pacific Northwest
]]>
<![CDATA[Fukuoka – Tokyo bike ride]]>https://vasilykorf.com/fukuoka-tokyo-bike-touring/5fa9c76c11e43407066549b5Sat, 09 Nov 2019 00:00:00 GMT

Completed Fukuoka to Tokyo 1300 km route by bicycle on the 30-year-old Japanese frame.

I rebuilt Kuwahara Pacer bike from late 80s for this trip. It was fun – check the video below.

Strava Route

]]>
<![CDATA[Trip to Morocco]]>Highlights from my winter trip to Morocco. Tangier, Chefchaouen, Fes, Rabat, Marrakesh, and more.

DSCF3388
]]>
https://vasilykorf.com/morocco-2018/5fa9cc1b11e43407066549d7Wed, 09 Jan 2019 00:00:00 GMT

Highlights from my winter trip to Morocco. Tangier, Chefchaouen, Fes, Rabat, Marrakesh, and more.

Trip to Morocco
]]>
<![CDATA[Hosting a Ghost blog on AWS]]>https://vasilykorf.com/hosting-ghost-on-aws/5fa70e849dfac105ccabf237Tue, 15 May 2018 04:26:00 GMT

The last time I had a blog in the 00's, I ran it with WordPress. However, after starting working full-time in industry, I hardly have the time to maintain my personal blog. After COVID hit, the idea of running a new personal blog took root to my mind.

This blog intends to sum up and share how I see this world and what I have learned in technology space, especially around AI, deep learning, machine learning, data science, data mining, statistics or how is it fancy called now?

There are many ways to post things online, but having my own medium has several pros:

  • Full control over the both the visibility of my content and the design
  • Stay up-to-date on modern frontend technologies trends
  • Have a single place where I can collect my content, structurize my thoughts, and share my works
]]>