How many times have you started off building a complicated analytical SQL query like this? SELECT … . uh??? . . SELECT * FROM … SELECT … And you get stuck trying to figure out exactly what you want to select. You’re thinking about averages, group by’s, the order of your results or some change you want to see over time and the query editor is just sitting there, taunting you, because in SQL, you have to know up front what you want to select into your final results.
If you are a software engineer and you have just enough SQL to write queries that count, sum, average join and maybe sub-select, then I’m writing this post for you. If, when you need more complicated analysis or computation, you pull your query results into excel or your favorite scripting language to do more processing, then I have some good news.
There’s a whole lot more you can do right in SQL and it’s not too bad to learn how to do it.
In this post, I’m going to cover a few concepts that have recently helped me do more computation, better analysis, and it turns out, more efficient querying…and I hope they help you too.
The best upgrade I’ve made to my workflow in the past year was to start keeping an engineering notebook.
Whenever I start a new project, the first thing I do is create a new section in my engineering notebook. It’s really simple. With a tiny script, I generate a dedicated folder for the new project, plus three sub-folders and a README:
some_new_work
|__ notes/
|__ data/
|__ scripts/
|__ README.md
As simple as this structure seems, it has had a tremendous impact on my work. In this post, I want to try to unpack why and how I think that’s working.
In my last post, I tried to tackle getting the Top N Results per Group from a BigQuery dataset.
When I tweeted out the post, I got some great feedback and suggestions for more efficient ways to get the same results, so in this post I want to try to understand why the alternatives are more efficient.
EDIT: After I posted this initially, I got some great feedback, so I wrote a follow-up post here.
In this post, we are going to explore a strategy for collecting the Top N results per Group over a mixed dataset, all in a single query.
I stumbled onto this solution the other day, mostly driven by the fear that I was re-scanning my BigQuery data too often. At the time, the only way I knew how to look at a Top 10 list of a subset of the data was to add a WHERE
clause limiting the whole data set to a single group and combine with ORDER BY
and LIMIT
clauses.
For each group, I would just modify the WHERE
clause, rescan all the data, and get new results. I thought there had to be an easier way to get the same ordered subset for any particular group in the data, all at once.
It turns out, there is a much more efficient way to solve this problem.
It’s easy to blow your BigQuery budget when you are exploring a new data set. Because you’re billed for the amount of data scanned, not the ultimate result set, when you don’t know what you’re looking for, you can end up with wasteful queries.
In this post, I’m going to share some tips for more efficiently scanning data in BigQuery when you don’t quite know what you need.
Last weekend, I had the chance to volunteer at a GoBridge event taught by Bill Kennedy of Ardan Labs. I’m trying to make 2017 my year of learning Go, so helping out at the event felt like a natural extension and a great way to connect with more people in the Go community.
Going in with Ruby as my first language, I braced myself for static typing and wanted concurrent programming to bend my brain, but that’s not really what happened at all.
Over the past couple of years, one thing I’ve become more and more aware of is the unease and uncertainty of diving into a new project. Not matter what the new X is, I find I always go through the same set of uncomfortable feelings on my initial approach.
Now, though, I’m starting to become familiar enough with this process that – even though the discomfort doesn’t go away in the initial learning stages – I can embrace it, coexist with it, and forge ahead in learning, because of the strategies I’m going to lay out here.